[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

burnus at gcc dot gnu.org via Gcc-bugs Thu, 30 Oct 2025 02:46:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |burnus at gcc dot gnu.org

--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> ---
I have now tried the following:

* Unpacked the first attachment (archive.tar.gz, attachment 62555)
* Compiled it with
    mpicxx -std=c++23 -fopenmp -I . mathdemonstrations.cpp

To aid debugging, I changed in line 154 GPU_ONLY to AUTO:

  Math_Functions_Policy p1(Math_Functions_Policy::AUTO);

And #if 0 everything after the following 'C.printtensor' (+ add a '}').

Result: When running it once manually, I got some 'wrong' results for the
host but not for the GPU. I have then run:

for ((I=1; $I<=10; I++)); do OMP_TARGET_OFFLOAD=disabled GOMP_DEBUG=1 ./a.out
|tail -n 12 > dis-$I; done

for ((I=1; $I<=10; I++)); do OMP_TARGET_OFFLOAD=mandatory GOMP_DEBUG=1 ./a.out
|tail -n 12 > mand-$I; done

And the debug output shows that it was indeed offloading.

* * *

Comparing that result with the clang output of comment 0 showed the same
result.

* * *

Having that said, when running it first manually, I got some differences for
the host fallback (but not GPU output) - which didn't reproduce when running it
as above.

That's an x86-64 system with an Nvidia sm_86 GPU: RTX A1000 6GB Laptop
and the distro compiler: 15.2.1 20251006.

I also tried it with -O3 and the current git version of GCC.
and also with -foffload-options=nvptx-none=-march=sm_80.

That's with NVIDIA-SMI 580.95.05, Driver Version: 580.95.05, CUDA Version:
13.0.

* * *

I wonder why I got once different results on the host - and I wonder why it
fails for the bug reporter. I hate Schroedinger bugs/heisenbugs!

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to