https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123272
Bug ID: 123272
Summary: [nvptx] miscompilation in matrix multiplication with
#pragma omp target teams parallel for collapse(2) if
members of classes using templates
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: schulz.benjamin at googlemail dot com
Target Milestone: ---
Created attachment 63128
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63128&action=edit
main.cpp
Attached is a test case with the following classes:
template<typename T>
class DataBlock1
{
public:
size_t* dpextents;
size_t* dpstrides;
T* dpdata;
size_t dpdatalength;
DataBlock1(size_t *ext,size_t *str, T *dat, size_t datlength):
dpextents(ext),dpstrides(str),dpdata(dat),dpdatalength(datlength) {}
};
class DataBlock2
{
public:
size_t* dpextents;
size_t* dpstrides;
double* dpdata;
size_t dpdatalength;
DataBlock2(size_t *ext,size_t *str, double* dat, size_t datlength):
dpextents(ext),dpstrides(str),dpdata(dat),dpdatalength(datlength) {}
};
Then the test case makes two matrix multiplications and with a
#pragma omp target teams distribute parallel for collapse(2)
construct in front of the first two loops.
One multiplication is for the Datablock1 class with the template, where the
template variable T, i.e. the dpdata member field, is set to double.
The other multiplication is for the DataBlock2 class without the template,
which has a double for the dpdata field.
Then a third multiplication is done single threaded on host for the template
with dpdata set to double.
If compiled with clang and -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda, the
three multiplications will agree.
If compiled with gcc-16 (current git) and nvptx, the multiplication with the
template class on gpu will disagree with the single threaded result, if
compiled with
-fopenmp -foffload=nvptx-none -fno-stack-protector
If on the other hand, we compile with gcc-16 (current git) for nvptx and
-fopenmp -foffload=nvptx-none -fno-stack-protector -O1
then the results agree again and are correct
Before the multiplication, the functions print out a statement whether they use
templates or not and when the offload and computation begins and ends.
As the template variable T is only set for the dpdata field and is set to
double, which is the same type dpdata has in the non-templated class, the same
assembly code for the data mapping and the loops in the two multiplication
functions should be exactly the same with and without the template.
So it should be possible to check what goes wrong here.
I observed the miscompilation in an old gtx 1660 super TI from 2018 (sm_75) as
well as in a brand new rtx 5060 Ti (sm 120).
I observed the problem with nvidia-cuda-toolkit 12.9.1 as well
nvidia-cuda-toolkit-13, and with different nvidia-driver versions that use
cuda-12 and cuda-13. But only with gcc. The problem never showed up if the
demonstrator is compiled with clang.
Note that this bug does not lead to an ice, an access violation or a crash, but
only to wrong results of a rather simple computation. Since only some numbers
are incorrect, the bug is problematic. Gpus are usually used for gigabyte large
matrices, which you can't easily check manually for a few wrong numbers. The
wrong numbers appear to be random, which points to a memory issue.
The data is mapped to gpu and released in line with the OpenMP standard.
compute-sanitizer --tool memcheck would report zero errors if compiled with
clang.