https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122281
Benjamin Schulz <schulz.benjamin at googlemail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #62741|0 |1
is obsolete| |
Attachment #62742|0 |1
is obsolete| |
--- Comment #15 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Created attachment 62745
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62745&action=edit
testprograms.tar.gz
I have now updated my file in to clean the memory up correctly, which in this
test, I did not bother as that is not the problem that triggers the libgomp
error...
I now have done this homework, but the libgomp error remains, at least on my
card.
If i remove this line,
Aspan.device_data_upload(true);
Bspan.device_data_upload(true);
Cspan.device_data_alloc(true);
then, the constructor of datablockcontainer does the zero block identification
on the host and the sparse multiplication offloads all the fields with mapping
macros, and everything works fine.
However, if i call the above line, then,
the dpdata pointer of the block field of datablockcontainer is copied to
another field, called former_host_ptr, and the dpdata field is allocated on
device, and the content of former_host_ptr is copied there with
omp_target_alloc.
dL.devptr_former_hostptr=dL.dpdata;
dL.dpdata=alloc_device_ptr(dL.dpdatalength,devicenum);
dL.devptr_devicenum=devicenum;
dL.dpdata_is_devptr=true;
omp_target_memcpy(dL.dpdata,dL.devptr_former_hostptr,sizeof(T)*
dL.dpdatalength,0,0,dL.devptr_devicenum, omp_get_initial_device());
However, the strides and extents, are still assumed to be mapped by mapping
macros.
What then happens in the constructor
BlockedDataView<double> Ablocks1(Aspan, block_shape,true);
BlockedDataView<double> Bblocks2(Bspan, block_shape2,true);
is that a function build_blocks_rank2 is called:
it gets dpdata as a device pointer and it allocates
pooled_offsets_flat,pooled_offsets_starts by omp_target_alloc.
and then it uses
is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_starts)
in an OpenMP target teams construct...
As explained my first post..
since pd is a device pointer, it should be allowed to do things like in
DataBlockContainer in line 252:
#pragma omp target teams distribute map(tofrom:count) shared(count)
is_device_ptr(pd,pooled_offsets_flat,pooled_offsets_starts)
device(dblock.devptr_devicenum)
for (size_t bi = 0; bi < nblocks_row; ++bi)
{
#pragma omp parallel for shared(count)
for (size_t bj = 0; bj < nblocks_col; ++bj)
{
const size_t row_off = bi * block_rows;
const size_t diff1 = ext0 - row_off;
const size_t tile_rows = (block_rows < diff1) ? block_rows
: diff1;
bool keep = true;
const size_t col_off = bj * block_cols;
const size_t diff2 = ext1 - col_off;
const size_t tile_cols = (block_cols < diff2) ? block_cols
: diff2;
if (remove_zeroblocks)
{
keep = false;
for (size_t i = 0; i < tile_rows && !keep; ++i)
for (size_t j = 0; j < tile_cols && !keep; ++j)
if (pd[(row_off + i) * str0 + (col_off + j)
*str1] != T(0))
{
keep = true;
goto outofloop3;
}
}
outofloop3:
if (keep)
{
size_t slot;
#pragma omp atomic capture
slot = count++;
const size_t pos = slot * 2;
pooled_offsets_starts[slot] = pos;
pooled_offsets_flat[pos] = row_off;
pooled_offsets_flat[pos+1] = col_off;
}
}
}
If I remove:
size_t slot;
#pragma omp atomic capture
slot = count++;
size_t pos = slot * 2;
pooled_offsets_starts[slot] = pos;
pooled_offsets_flat[pos] = row_off;
pooled_offsets_flat[pos+1] = col_off;
in line 288 of datablockcontainer.h the code works correctly.
But I don't know why that should not be allowed.
I allocate the memory correctly with omp_target_alloc in line 235:
pooled_offsets_flat = offsets_starts_is_devptr
? (size_t*)omp_target_alloc(sizeof(size_t) * 2 *
maxblocks, devnum)
: new size_t[2 * maxblocks];
pooled_offsets_starts = offsets_starts_is_devptr
? (size_t*)omp_target_alloc(sizeof(size_t) *
(maxblocks + 1),devnum)
: new size_t[maxblocks + 1];
The size is correct, and omt_target_alloc is called, I've checked that.
Also, in the mathdemonstrations.cpp, it seems to have problems when a struct
with containing fields is mapped, and several of its fields which are arrays
are mapped individually, and one of its field is not mapped but allocated by
omp target alloc...
The versions of cuda and gcc which I had on my system in May-July worked in
this circumstances.
Apparently, with the new nvidia driver not anymore with my card...
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
Should not happe with this code, I think...