[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-02-02 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

Tom de Vries  changed:

   What|Removed |Added

   Keywords||testsuite-fail
 Resolution|--- |FIXED
   Target Milestone|--- |12.0
 Status|UNCONFIRMED |RESOLVED

--- Comment #16 from Tom de Vries  ---
All these fails should be fixed on current trunk.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-26 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #15 from Tom de Vries  ---
(In reply to Tom de Vries from comment #14)
> An observation when playing around with vector-length-128-4.c:

Another observation:
...
  $L11:
ld.u64 %r108,[%r109];
st.u64 [%r112],%r108;
setp.lt.u32 %r111,%r110,3;
add.u32 %r110,%r110,1;
add.u64 %r109,%r109,8;
add.u64 %r112,%r112,8;
@ %r111 bra.uni $L11;
...

The bra.uni in the broadcast loop is incorrect, it's used in a vector-neutered
block.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-26 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #14 from Tom de Vries  ---
An observation when playing around with vector-length-128-4.c: there are two
ways in which I can make the example pass.

1. add barrier.sync.aligned 0 or membar.cta after first broad-cast receive

2. unroll loop in first broad-cast send.

At first glance, it doesn't look entirely trivial though to implement either.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-25 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #13 from Tom de Vries  ---
(In reply to Tom de Vries from comment #10)
> [ FTR, T400, driver 470.94 ]
> 
> Interestingly, changing the default ptx version to 6.3 makes the minimal
> test-case pass, as well as the full parallel-dims.c
> 
> The only code changes are shfl -> shfl.sync and vote -> vote.sync.
> 

It seems another change is required.

Starting with 6.0, bar.sync maps onto barrier.sync.aligned, where the aligned
means that "all threads in CTA will execute the same barrier instruction. In
conditionally executed code, an aligned barrier instruction should only be used
if it is known that all threads in CTA evaluate the condition identically,
otherwise behavior is undefined."

It's not fully clear what is meant with "the same barrier instruction" or
"condition", but in the case of vector_length > 32, we use:
...
bar.sync %r67,64;
...
where %r67 is a barrier number, 1 for worker 0 and 2 for worker 1 in case of 2
workers.  It may well be that it's invalid to use bar.sync for this, and we
should use barrier.sync instead.

But then there's an isa note:
...
Note: For .target sm_6x or below,
1. barrier instruction without .aligned modifier is equivalent to .aligned
variant and has the same restrictions as of .aligned variant.
...
which seems to imply that we get back barrier.sync.aligned behaviour for sm_6x
and earlier, which would again break vector_length > 32.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-25 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #12 from Tom de Vries  ---
Created attachment 52285
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52285=edit
Cuda reproducer non-32 vector length

[ On T400, driver version 470.94 ]

NVCC SASS:
...
$ ./do.sh 
NVCC SASS, ptxas=-O0:
+ /home/vries/cuda/11.5.1/bin/nvcc vector-length-64.cu -arch=compute_75
-code=sm_75 -Xptxas -O0
+ ./a.out
NVCC SASS, ptxas=-O1:
+ /home/vries/cuda/11.5.1/bin/nvcc vector-length-64.cu -arch=compute_75
-code=sm_75 -Xptxas -O1
+ ./a.out

...

Driver SASS:
...
$ ./do.sh 
DRIVER SASS, ptxas=-O0:
+ /home/vries/cuda/11.4.3/bin/nvcc vector-length-64.cu -arch=compute_75 -Xptxas
-O0
+ ./a.out
DRIVER SASS, ptxas=-O1:
+ /home/vries/cuda/11.4.3/bin/nvcc vector-length-64.cu -arch=compute_75 -Xptxas
-O1
+ ./a.out

...

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-25 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #11 from Tom de Vries  ---
(In reply to Tom de Vries from comment #10)
> Rerunning the entire testsuite though shows that the non-32-vector-length
> test-cases are still failing.

Minimal example:
...
int
main (void)
{
#pragma acc parallel num_workers (2) vector_length (64)
  {
#pragma acc loop worker
for (unsigned int i = 0; i < 2; i++)
#pragma acc loop vector
  for (unsigned int j = 0; j < 64; j++)
;
  }

  return 0;
}
...

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-24 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #10 from Tom de Vries  ---
[ FTR, T400, driver 470.94 ]

Interestingly, changing the default ptx version to 6.3 makes the minimal
test-case pass, as well as the full parallel-dims.c

The only code changes are shfl -> shfl.sync and vote -> vote.sync.

Both of these require sm_30, so from that perspective we won't leave any
architectures behind.

OTOH, this may leave behind:
- some older drivers
- some older CUDAs (if ptxas is used for ptx verification in the
nvptx-none-as).

Rerunning the entire testsuite though shows that the non-32-vector-length
test-cases are still failing.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-24 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #9 from Tom de Vries  ---
Created attachment 52273
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52273=edit
New cuda reproducer

$ ./do.sh 
DRIVER SASS, ptxas=-O0:
+ /home/vries/cuda/11.4.3/bin/nvcc vector-max.cu -Wno-deprecated-gpu-targets
-arch=compute_35 -Xptxas -O0
+ ./a.out
a[0]: 31
DRIVER SASS, ptxas=-O1:
+ /home/vries/cuda/11.4.3/bin/nvcc vector-max.cu -Wno-deprecated-gpu-targets
-arch=compute_35 -Xptxas -O1
+ ./a.out
a[0]: 0
./do.sh: line 34: 27353 Aborted (core dumped) ./a.out

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2022-01-24 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #8 from Tom de Vries  ---
New minimal oacc example:
...
int
main (void)
{
  int vectors_max = -1;

#pragma acc parallel\
  num_gangs (1) num_workers (1) \
  copy (vectors_max)
  {
for (int i = 0; i < 2; i++)
  for (int j = 0; j < 2; j++)
#pragma acc loop vector reduction (max: vectors_max)
for (int k = 0; k < 32; k++)
  vectors_max = k;
  }

  if (vectors_max != 31)
__builtin_abort ();

  return 0;
}
...

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2021-12-09 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #7 from Tom de Vries  ---
(In reply to Tom de Vries from comment #6)
> (In reply to Tom de Vries from comment #5)
> > FIled https://developer.nvidia.com/nvidia_bug/3299227
> 
> Nvidia reported it will be fixed in the next major cuda release. I've asked
> about driver fixes.

I've tested 470.86, and can confirm that the failure as reported to nvidia has
disappeared.

However, the example from comment 2 still passes with GOMP_NVPTX_JIT=-O0, and
starts failing at GOMP_NVPTX_JIT=-O1.

So it looks like we'll need to file another bug-report at nvidia.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2021-04-27 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #6 from Tom de Vries  ---
(In reply to Tom de Vries from comment #5)
> FIled https://developer.nvidia.com/nvidia_bug/3299227

Nvidia reported it will be fixed in the next major cuda release. I've asked
about driver fixes.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2021-04-24 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #5 from Tom de Vries  ---
FIled https://developer.nvidia.com/nvidia_bug/3299227

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2021-04-23 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #4 from Tom de Vries  ---
Created attachment 50662
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50662=edit
Updated cuda reproducer

Slimmed down further, eliminated gang/worker reduction parts.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2021-04-23 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #3 from Tom de Vries  ---
Created attachment 50660
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50660=edit
Cuda reproducer

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2021-04-23 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #2 from Tom de Vries  ---
Minimal example:
...
$ cat libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
int
main (void)
{
  int vectors_max = -1;
#pragma acc parallel \
  num_gangs (1) \
  num_workers (1) \
  vector_length (32) \
  copy (vectors_max)
{
#pragma acc loop gang reduction (max: vectors_max)
  for (int i = 0; i < 2; i++)
#pragma acc loop worker reduction (max: vectors_max)
for (int j = 0; j < 2; j++)
#pragma acc loop vector reduction (max: vectors_max)
  for (int k = 0; k < 32; k++)
vectors_max = k;
}

  if (vectors_max != 31)
__builtin_abort ();

  return 0;
}
...

Passes with GOMP_NVPTX_JIT=-O0, starts failing at GOMP_NVPTX_JIT=-O1.

[Bug target/99932] OpenACC/nvptx offloading execution regressions starting with CUDA 11.2-era Nvidia Driver 460.27.04

2021-04-22 Thread vries at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932

--- Comment #1 from Tom de Vries  ---
(In reply to Thomas Schwinge from comment #0)
> We're seeing OpenACC/nvptx offloading execution regressions (including a lot
> of timeouts) starting with CUDA 11.2-era Nvidia Driver 460.27.04.  Confirmed
> with: CUDA 11.2-era 460.27.04, 460.32.03, 460.39, 460.56, 460.67, and CUDA
> 11.3-era 465.19.01, across several variants of GPU hardware.
> 
> Explicitly (re-)confirmed good are older versions such as CUDA 9.1-era
> 390.12, and CUDA 11.1-era 455.38, 455.45.01.
> 
> Most of these are in the 'vector_length > 32' testcases, but also a few
> others.
> 

Confirmed, I see on ubuntu 18.04.5 with dirver version 460.67:
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/parallel-dims.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2 
execution test
...