https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120679
Bug ID: 120679 Summary: nvptx stl support for automatic mapping is slow and not asynchronous despite cuda supporting that. also #pragma omp requires unified_shared_memory is extremely slow compared to unified_address Product: gcc Version: 15.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: schulz.benjamin at googlemail dot com Target Milestone: --- Created attachment 61650 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61650&action=edit benchmark for working with arrays/stl vectors, mapping, unified_address vs. unified_shared_memory Hi there, I have an RTX 5060 gpu which has the following features: Unified Addressing: Yes Managed Memory: Yes Concurrent Managed Memory: Yes Preemption Supported: Yes Cooperative Launch: Yes Cluster Launch: Yes Unified Function Pointers: Yes Unified Memory: HMM Memory Models Flags: -gpu=mem:separate, -gpu=mem:managed, -gpu=mem:unified Default Target: cc120 It is often mentioned in nvidia's forums that one can use the heterogeneous memory management to let the driver map stl vectors. https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/ With gcc 15.1 i discovered that one can apparently write #pragma omp target teams distribute parallel for simd for(size_t i=1;i<elements;i++) { v3[i]=v1[i]+v2[i]; } over pointers without any manual mapping pragmas, and with gcc-15.1 one may even loop over stl vectors v1,v2,v3 and the copying will then be done automatically and the loop will work on gpu.. I now have made a benchmark which is attached... It fills 2 stl vectors on cpu, starts a cuda kernel with a dummy loop, and then does a run with manual mappings and one without(i.e. automatical mapping), and works over a loop once with the stl vectors and once with the pointer of their arrays. In the loop, it adds the 2 vectors v1,v2 together and puts them in the third vector v3. I compiled with -O3 -fopenmp -foffload=nvptx-none -fno-stack-protector -std=c++23 The results are this: on GPU with mapping with pointers:52.0362 on GPU without mapping with vectors:408.225 on GPU without mapping with pointers:410.298 on GPU already mapped vector with pointers:1.03424 on CPU with pointers:13.1285 on CPU with vectors:17.7788 What this shows is that once the vector is mapped to the gpu, the calculation is roughly 10x faster than my cpu. But it also shows that manually calling #pragma omp target enter data and then map the array to and from the device is faster than letting gcc do this.. What I want to argue here, is that especially if this is done automatically, there could be more room for optimization. For example, one could copy the array asynchronously while the loop is executed. E.g say we copy the array in sets of two halves. When the first half is completed, start the loop on that half. and during that computation, do an asynchronous upload of the second half of the data, and when the loop for the first half is completed, the second half of the data is ready on the device and so on... That probably would speed up things considerably. Cuda has the ability do to asynchronous copies, especially with shared memory: https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/?_gl=1*1plcfms*_gcl_au*MjExOTUxNjc5LjE3MTE1MjA0ODA. When I use a better manual mapping strategy, where v1 and v2 are uploaded, v3 is only allocated and after the computation only v3 is downloaded, then, I get on GPU with mapping with pointers:26.4436 on GPU without mapping with vectors:413.701 on GPU without mapping with pointers:399.439 on GPU already mapped vector with pointers:1.04433 on CPU with pointers:11.5262 on CPU with vectors:11.6417 But using nowait on the first two mappings and a taskwait before the loop results in this on GPU with mapping with pointers:26.2541 on GPU without mapping with vectors:417.334 on GPU without mapping with pointers:409.019 on GPU already mapped vector with pointers:1.04093 on CPU with pointers:12.6094 on CPU with vectors:14.3938 No speedup from the asynchronous mapping unfortunately. Note that as far as I know, openmp supports depend clauses on the target distribute statements, which would support blocked asynchronous upload, I suppose, so I guess.. one could manage it that the loop evens starts some blocks before all are uploaded https://www.openmp.org/spec-html/5.0/openmpsu99.html https://www.openmp.org/spec-html/5.0/openmpsu58.html Then, if I hang #pragma omp requires unified_shared_memory code, the results are strange (way too slow): on GPU with mapping with pointers:405.639 on GPU without mapping with vectors:410.369 on GPU without mapping with pointers:413.107 on GPU already mapped vector with pointers:412.337 on CPU with pointers:13.7056 on CPU with vectors:12.6914 It is not quite clear why it should take so long then. My gpu supports hmm... With #pragma omp requires unified_address and naive mapping, one gets more or less the numbers before: n GPU with mapping with pointers:51.8929 on GPU without mapping with vectors:417.563 on GPU without mapping with pointers:407.834 on GPU already mapped vector with pointers:1.03666 on CPU with pointers:12.3097 on CPU with vectors:11.6552 So I guess unified address is what is using automatically.. But it is apparently not using asynchronous mapping strategies and one therefore still needs to map manually, as the automatic process is slow... And apparently, using asynchronous mapping did not lead to a speed improvement here unfortunately...