https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123750
Bug ID: 123750
Summary: ICE on valid code [openmp]: Segmentation fault for
#pragma omp target teams distribute parallel for
collapse(2) map(tofrom:count)
shared(pooled_offsets_starts,pooled_offsets_flat)
is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_st
arts) device(devnum)
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: schulz.benjamin at googlemail dot com
Target Milestone: ---
Created attachment 63447
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63447&action=edit
sourcecode.tar.gz
The attached file contains source code for a library with many linear algebra
functions.
It also contains a cmakelists.txt which creates some test applications for host
and nvptx gpus using openmp if used with cmake . and then make.
If unchanged, the cmakelists.txt will use /usr/bin/g++-16 . But one can comment
the line for the compler out and activate three lines for compilation with
/usr/lib/llvm/21/bin/clang++
With clang, the test programs sparsetests and mathdemonstrations would compile
and run without errors, also, without errors in cuda-sanitizer.
With gcc, one would get:
In file included from
/home/benni/projects/arraylibrary/openmp/sparsetests.cpp:3:
/home/benni/projects/arraylibrary/openmp/datablockcontainer.h: In instantiation
of 'void BlockedDataView<T>::build_blocks_rank2(size_t, size_t, bool) [with T =
double; size_t = long unsigned int]':
required from 'BlockedDataView<T>::BlockedDataView(const DataBlock<T>&, const
size_t*, bool) [with T = double; size_t = long unsigned int]'
/home/benni/projects/arraylibrary/openmp/datablockcontainer.h:84:13:
84 | build_blocks_rank2(bshape[0], bshape[1],
remove_zeroblocks);
| ^~~~~~~~~~~~~~~~~~
required from here
/home/benni/projects/arraylibrary/openmp/sparsetests.cpp:76:57:
76 | BlockedDataView<double> Ablocks(Ad, block_shape,true);
|
^
/home/benni/projects/arraylibrary/openmp/datablockcontainer.h:251:38:
internal compiler error: Segmentation fault
251 | #pragma omp target teams distribute parallel for
collapse(2) map(tofrom:count) shared(pooled_offsets_starts,pooled_offsets_flat)
is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_starts) device(devnum)
|
^~~~~~~~~~
/usr/libexec/gcc/x86_64-pc-linux-gnu/16/cc1plus -quiet -I
/home/benni/projects/arraylibrary/openmp/datablocktests -I
/home/benni/projects/arraylibrary/openmp -I
/home/benni/projects/arraylibrary/openmp/mdspantests -I
/home/benni/projects/arraylibrary/openmp/mathdemonstrations -I
/home/benni/projects/arraylibrary/openmp/arraytest_mpi -I
/home/benni/projects/arraylibrary/openmp/sparsetests -MD
CMakeFiles/sparsetests.dir/sparsetests.cpp.d -MF
CMakeFiles/sparsetests.dir/sparsetests.cpp.o.d -MT
CMakeFiles/sparsetests.dir/sparsetests.cpp.o -D_GNU_SOURCE -D_REENTRANT
/home/benni/projects/arraylibrary/openmp/sparsetests.cpp -quiet -dumpdir
CMakeFiles/sparsetests.dir/ -dumpbase sparsetests.cpp.cpp -dumpbase-ext .cpp
-mtune=generic -march=x86-64 -mtls-dialect=gnu2 -Wall -std=gnu++23 -fopenmp
-fno-stack-protector -fdump-tree-all -o /tmp/cchxFaI8.s
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
See <https://bugs.gentoo.org/> for instructions.
The ice happens at datablockcontainer.h in line 251 at the statement:
#pragma omp target teams distribute parallel for collapse(2)
map(tofrom:count) shared(pooled_offsets_starts,pooled_offsets_flat)
is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_starts) device(devnum)
for the code:
#pragma omp target teams distribute parallel for collapse(2)
map(tofrom:count) shared(pooled_offsets_starts,pooled_offsets_flat)
is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_starts) device(devnum)
for (size_t bi = 0; bi < nblocks_row; ++bi)
{
for (size_t bj = 0; bj < nblocks_col; ++bj)
{
const size_t row_off = bi * block_rows;
const size_t diff1 = ext0 - row_off;
const size_t tile_rows = (block_rows < diff1) ? block_rows
: diff1;
const size_t col_off = bj * block_cols;
const size_t diff2 = ext1 - col_off;
const size_t tile_cols = (block_cols < diff2) ? block_cols
: diff2;
bool keep = true;
if (remove_zeroblocks)
{
keep = false;
for (size_t i = 0; i < tile_rows && !keep; ++i)
{
for (size_t j = 0; j < tile_cols && !keep; ++j)
{
if (pd[(row_off + i) * str0 + (col_off + j)
*str1] != T(0))
{
keep = true;
goto outofloop3;
}
}
}
}
outofloop3:
if (keep)
{
size_t slot;
#pragma omp atomic capture
slot = count++;
{
size_t pos = slot * 2;
pooled_offsets_starts[slot] = pos;
pooled_offsets_flat[pos] = row_off;
pooled_offsets_flat[pos+1] = col_off;
}
}
}
}
I can evade the ICE by removing
shared(pooled_offsets_starts,pooled_offsets_flat)
from line 251 and use
#pragma omp target teams distribute parallel for collapse(2)
map(tofrom:count) is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_starts)
device(devnum)
then, the sparsetests would compile, but at runtime, one would get:
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
I can remove even this cuCtxSynchronize error if I replace line 251 by
#pragma omp target map(tofrom:count)
is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_starts) device(devnum)
But then the loops are of course no longer parallelized. For whatever reason,
the code will fail to be executed correctly if the loops in the code above are
parallelized.
>From the code, the loops are, however, parallelizable. the atomic ensures that
there is no data race. This is also evident by clang being able to compile it
without errors.
One should note that the code is a function of a templated class. And it
declares thread private variables of template type.
There are problems with scoping of combined openmp constructs and templated
types in gcc 14/15/16, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123597
but i tried to work around them by putting everything in separate brackets...
The cmakelists.txt creates another program called mathdemonstrations. It will
fail at runtime with the same cuda errors starting to execute an LU
decomposition:
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
It will, of course also run without failure when compiled with clang.
The test programs created by cmakelists.txt are such that they run on the
omp_get_default_device. The problems I reported above, were observed for an
nvidia rtx 5060 ti card.
If you need a different device than the omp_get_default_device(), you would
need to change the lines 96-98 in sparsetests.cpp:
cout<<"did the offload of A work?: "<<Aspan.device_data_upload(true)<<endl;
cout<<"did the offload of B work?: "<<Bspan.device_data_upload(true)<<endl;
cout<<"did the offload of C work?: "<<Cspan.device_data_alloc(true)<<endl;
into
cout<<"did the offload of A work?:
"<<Aspan.device_data_upload(false,mydevicenumber)<<endl;
cout<<"did the offload of B work?:
"<<Bspan.device_data_upload(false,mydevicenumber)<<endl;
cout<<"did the offload of C work?:
"<<Cspan.device_data_alloc(false,mydevicenumber)<<endl;
where mydevicenumber is an int set to the number of your device where you want
the code to offload and run...
in mathdemonstrations.cpp, there are various objects of the class
Math_Functions_Policy and Math_MPI_RecursiveMultiplication_Policy.
These have a member field called devnum, which would have to be set to the
appropriate device on which you want to run the code.
I am sorry i can not provide a shorter test case. Whenever I simplify more, the
code seems to work.
The problems seem to appear when using templated classes, templated types
within parallelized OpenMP loops, shared variables and device pointers and
combined constructs in OpenMP on nvptx target regions...