[Bug target/120679] New: nvptx stl support for automatic mapping is slow and not asynchronous despite cuda supporting that. also #pragma omp requires unified_shared_memory is extremely slow compared to unified_address

schulz.benjamin at googlemail dot com via Gcc-bugs Mon, 16 Jun 2025 14:07:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120679


            Bug ID: 120679
           Summary: nvptx stl support for automatic mapping is slow and
                    not asynchronous despite cuda supporting that. also
                    #pragma omp requires unified_shared_memory is
                    extremely slow compared to unified_address
           Product: gcc
           Version: 15.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
  Target Milestone: ---

Created attachment 61650
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61650&action=edit
benchmark for working with arrays/stl vectors, mapping, unified_address vs.
unified_shared_memory

Hi there,

I have an RTX 5060 gpu which has the following features:
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
Cluster Launch:                Yes
Unified Function Pointers:     Yes
Unified Memory:                HMM
Memory Models Flags:           -gpu=mem:separate, -gpu=mem:managed,
-gpu=mem:unified
Default Target:                cc120


It is often mentioned in nvidia's forums that one can use the heterogeneous
memory management to let the driver map stl vectors.

https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

With gcc 15.1 i discovered that one can apparently write 

  #pragma omp target teams distribute parallel for simd
    for(size_t i=1;i<elements;i++)
    {
       v3[i]=v1[i]+v2[i];
    }

over pointers without any manual mapping pragmas, and with gcc-15.1 one may
even loop over stl vectors v1,v2,v3 and the copying will then be done
automatically and the loop will work on gpu..


I now have made a benchmark which is attached...

It fills 2 stl vectors on cpu, starts a cuda kernel with a dummy loop, and then 
does a run with manual mappings and one without(i.e. automatical mapping), and
works over a loop once with the stl vectors and once with the pointer of their
arrays. 

In the loop, it adds the 2 vectors v1,v2 together and puts them in the third
vector v3.


I compiled with  -O3 -fopenmp -foffload=nvptx-none  -fno-stack-protector
-std=c++23

The results are this:

on GPU with mapping with pointers:52.0362
on GPU without mapping with vectors:408.225
on GPU without mapping with pointers:410.298
on GPU already mapped vector with pointers:1.03424
on CPU with pointers:13.1285
on CPU with vectors:17.7788

What this shows is that once the vector is mapped to the gpu, the calculation
is roughly 10x faster than my cpu.

But it also shows that manually calling 

#pragma omp target enter data 

and then map the array to and from the device is faster than letting gcc do
this..

What I want to argue here, is that especially if this is done automatically,
there could be more room for optimization.

For example, one could copy the array asynchronously while the loop is
executed. E.g say we copy the array in sets of two halves. When the first half
is completed, start the loop on that half. and during that computation, do an
asynchronous upload of the second half of the data, and when the loop for the
first half is completed, the second half of the data is ready on the device and
so on... That probably would speed up things considerably.

Cuda has the ability do to asynchronous copies, especially with shared memory:

https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/?_gl=1*1plcfms*_gcl_au*MjExOTUxNjc5LjE3MTE1MjA0ODA.


When I use a better manual mapping strategy, where v1 and v2 are uploaded, v3
is only allocated and after the computation only v3 is downloaded, then, I get

on GPU with mapping with pointers:26.4436
on GPU without mapping with vectors:413.701
on GPU without mapping with pointers:399.439
on GPU already mapped vector with pointers:1.04433
on CPU with pointers:11.5262
on CPU with vectors:11.6417


But using nowait on the first two mappings and a taskwait before the loop
results in this

on GPU with mapping with pointers:26.2541
on GPU without mapping with vectors:417.334
on GPU without mapping with pointers:409.019
on GPU already mapped vector with pointers:1.04093
on CPU with pointers:12.6094
on CPU with vectors:14.3938

No speedup from the asynchronous mapping unfortunately.

Note that as far as I know, openmp supports depend clauses on the target
distribute statements, which would support blocked asynchronous upload, I
suppose, so I guess.. one could manage it that the loop evens starts some
blocks before all are uploaded

https://www.openmp.org/spec-html/5.0/openmpsu99.html

https://www.openmp.org/spec-html/5.0/openmpsu58.html

Then, if I hang 

#pragma omp requires unified_shared_memory

code, the results are strange (way too slow):

on GPU with mapping with pointers:405.639
on GPU without mapping with vectors:410.369
on GPU without mapping with pointers:413.107
on GPU already mapped vector with pointers:412.337
on CPU with pointers:13.7056
on CPU with vectors:12.6914


It is not quite clear why it should take so long then. My gpu supports hmm...


With 

#pragma omp requires unified_address

and naive mapping, one gets more or less the numbers before:

n GPU with mapping with pointers:51.8929
on GPU without mapping with vectors:417.563
on GPU without mapping with pointers:407.834
on GPU already mapped vector with pointers:1.03666
on CPU with pointers:12.3097
on CPU with vectors:11.6552

So I guess unified address is what is using automatically..

But it is apparently not using asynchronous mapping strategies and one
therefore still needs to map manually, as the automatic process is  slow...
And apparently, using asynchronous mapping did not lead to a speed improvement
here unfortunately...

[Bug target/120679] New: nvptx stl support for automatic mapping is slow and not asynchronous despite cuda supporting that. also #pragma omp requires unified_shared_memory is extremely slow compared to unified_address

Reply via email to