Re: [PATCH v6 1/2] libgomp, nvptx: Cuda pinned memory

Tobias Burnus Sat, 18 Oct 2025 09:40:00 -0700

Hi all,

Andrew Stubbs wrote:

On 17/10/2025 16:18, Tobias Burnus wrote:
Slightly offtopic remark:
I think it 'only' helps with performance when
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES
is false - and can also be achieved by using cuMemHostRegister as
alternative. I wrote 'only' as the only two system I know that have
this attribute set are a Grace-Hopper GH100 system and ORNL's former
Summit system.
I do not recall on which system we observed the speed-up (a couple ofyears ago), but it could not have been those systems as we did nothave access to either at the time. We did observe a real-worldspeed-up using one of the x86_64 test machines we had then (however,the PowerPC machine I tested did not demonstrate the advantage).

[Note that the PowerPC + Vega machine that we had didn't have this flag,contrary to Summit.]

The speed up makes sense, based on Nvidia's documentationfor cuMemAllocHost/cuMemAllocHost:

"The driver tracks the virtual memory ranges allocated with thisfunction and automatically accelerates calls to functions such ascuMemcpy(). Since the memory can be accessed directly by the device, itcan be read or written with much higher bandwidth than pageable memoryobtained with functions such as malloc()."

Likewise by "normal" memory allocations, followed bycuMemHostRegister: "This memory range also is added to the same trackingmechanism as cuMemHostAlloc to automatically accelerate calls tofunctions such as cuMemcpyHtoD(). […]"

I tried to find the quote again that links the latter toCU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, but Ifailed.

I now assume that I misread it. Still, I assume that this feature helphere – as it avoids extra hops. Whether it sufficiently helps to avoidcalling cuMemHostRegister or only whether it still helps withperformance is unclear, but could be tested (using, e.g., GH100 machine).

But independent from whether it is required for GH100 or not, it surelyshould help at least on all other systems that do not have this feature– and it avoids (for the better or worse) the 'ulimit' for locked memory.


* * *

For completeness: The tests assume that either no Nvidia GPU
is available or the default device is an Nvidia GPU.
Otherwise, offload_device_nvptx will be false but still
cuMemHostAlloc is called.

However, that shouldn't be an issue most of the time as
libgomp first checks for CUDA devices - and the default
device becomes the first device found.

The 'checked first' seems to be true as most GCC builds configure for--enable-offload-targets=nvptx-none,amdgcn-amdhsa (in this order,assuming both are configured for) – also due to historic reasons (nvptxwas available first), but, actually, the runtime check ('dlopen') isdone in the order specified by this configure option.

But I have to correct my assertion that it is always in the order nvptxthen gcn.


* * *

In that case, I think we might have bigger problems trying to test onsystems that have both flavours of GPU. I have not tested thatconfiguration, but I have run the tests on both AMD and NVPTXindependently.

I think the number of systems that have both an Nvidia and an AMD devicewith both CUDA and ROCm's ROCr installed is small. Then they needadditionally to have a different default device than the Nvidia one.

Only in this case, there will be a fail. That seems to be a ratherspecial case – and the only harm will be that those few pinned-memorytestcases fail; the pinned-memory handling should just work fine.


* * *

OK, I will review realloc in both patches.

Thanks. (I admittedly forgot to recheck when looking at your "PATCH v62/2".)


* * *

Question: How many gomp_debug calls exist for a successful
linux_memspace_calloc that uses cuMemHostRegister?

Answer: 11 - if I have not miscounted.


[…]

The most absurd is the following one, which is always
shown:

+         gomp_debug (0, " using_device=%d, using_device_old=%d\n",
+                     using_device, using_device_old);
+         assert (using_device_old == -1

+ /* We shouldn't have concurrently changed ourmind. */

+                 || using_device_old == using_device);


But I also find

+      for (int i = 0; i < num_devices; ++i)
+       {
+         gomp_debug (0, "  i=%d, target_id=%d\n",
+                     i, devices[i].target_id);


rather useless.


Thomas, do you have an opinion about these? I believe you added them?

The first one looks like a debugging item. using_device_old is eitherexpected to be '-1' (uninitialized) or, if some other progress ranconcurrently, the same value (no device = 0/false, some device = 1/true)is expected.

Thus, the assert kind of makes sense, albeit it seems to be overlybullet proof - as all reasonable cases where it could fail should havefailed before with a call to gomp_fatal.

But independent of the assert, I bet that the value of that variablescan be also be deferred from the remaining diagnostic output, unless itis cut to zero.

For the second one, it essentially lists how many devices of a specifictype exist. But that's not really relevant whether there is one or tenNvidia GPUs, one or eight AMD GPUs. Again, that output makes sense tocheck while developing this function, but now that it has beenimplemented, I don't think that there is much insight gained from thisdebugging output.


Hence:

Can this be trimmed to something sensible for users?

* * *

I should be able to repost on Monday, or so.


Thanks,

Tobias

Re: [PATCH v6 1/2] libgomp, nvptx: Cuda pinned memory

Reply via email to