Hi all,

Andrew Stubbs wrote:
On 17/10/2025 16:18, Tobias Burnus wrote:
Slightly offtopic remark:
I think it 'only' helps with performance when
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES
is false - and can also be achieved by using cuMemHostRegister as
alternative. I wrote 'only' as the only two system I know that have
this attribute set are a Grace-Hopper GH100 system and ORNL's former
Summit system.
I do not recall on which system we observed the speed-up (a couple of years ago), but it could not have been those systems as we did not have access to either at the time.  We did observe a real-world speed-up using one of the x86_64 test machines we had then (however, the PowerPC machine I tested did not demonstrate the advantage).

[Note that the PowerPC + Vega machine that we had didn't have this flag, contrary to Summit.]

The speed up makes sense, based on Nvidia's documentation for cuMemAllocHost/cuMemAllocHost:

"The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cuMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc()."

Likewise by "normal" memory allocations, followed by cuMemHostRegister: "This memory range also is added to the same tracking mechanism as cuMemHostAlloc to automatically accelerate calls to functions such as cuMemcpyHtoD(). […]"

I tried to find the quote again that links the latter to CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, but I failed.

I now assume that I misread it. Still, I assume that this feature help here – as it avoids extra hops. Whether it sufficiently helps to avoid calling cuMemHostRegister or only whether it still helps with performance is unclear, but could be tested (using, e.g., GH100 machine).

But independent from whether it is required for GH100 or not, it surely should help at least on all other systems that do not have this feature – and it avoids (for the better or worse) the 'ulimit' for locked memory.

* * *

For completeness: The tests assume that either no Nvidia GPU
is available or the default device is an Nvidia GPU.
Otherwise, offload_device_nvptx will be false but still
cuMemHostAlloc is called.

However, that shouldn't be an issue most of the time as
libgomp first checks for CUDA devices - and the default
device becomes the first device found.

The 'checked first' seems to be true as most GCC builds configure for --enable-offload-targets=nvptx-none,amdgcn-amdhsa (in this order, assuming both are configured for) – also due to historic reasons (nvptx was available first), but, actually, the runtime check ('dlopen') is done in the order specified by this configure option.

But I have to correct my assertion that it is always in the order nvptx then gcn.

* * *

In that case, I think we might have bigger problems trying to test on systems that have both flavours of GPU. I have not tested that configuration, but I have run the tests on both AMD and NVPTX independently.

I think the number of systems that have both an Nvidia and an AMD device with both CUDA and ROCm's ROCr installed is small. Then they need additionally to have a different default device than the Nvidia one.

Only in this case, there will be a fail. That seems to be a rather special case – and the only harm will be that those few pinned-memory testcases fail; the pinned-memory handling should just work fine.

* * *

OK, I will review realloc in both patches.

Thanks. (I admittedly forgot to recheck when looking at your "PATCH v6 2/2".)

* * *

Question: How many gomp_debug calls exist for a successful
linux_memspace_calloc that uses cuMemHostRegister?

Answer: 11 - if I have not miscounted.

[…]

The most absurd is the following one, which is always
shown:

+         gomp_debug (0, " using_device=%d, using_device_old=%d\n",
+                     using_device, using_device_old);
+         assert (using_device_old == -1
+                 /* We shouldn't have concurrently changed our mind.  */
+                 || using_device_old == using_device);

But I also find

+      for (int i = 0; i < num_devices; ++i)
+       {
+         gomp_debug (0, "  i=%d, target_id=%d\n",
+                     i, devices[i].target_id);

rather useless.

Thomas, do you have an opinion about these? I believe you added them?

The first one looks like a debugging item. using_device_old is either expected to be '-1' (uninitialized) or, if some other progress ran concurrently, the same value (no device = 0/false, some device = 1/true) is expected.

Thus, the assert kind of makes sense, albeit it seems to be overly bullet proof - as all reasonable cases where it could fail should have failed before with a call to gomp_fatal.

But independent of the assert, I bet that the value of that variables can be also be deferred from the remaining diagnostic output, unless it is cut to zero.


For the second one, it essentially lists how many devices of a specific type exist. But that's not really relevant whether there is one or ten Nvidia GPUs, one or eight AMD GPUs. Again, that output makes sense to check while developing this function, but now that it has been implemented, I don't think that there is much insight gained from this debugging output.

Hence:

Can this be trimmed to something sensible for users?
* * *
I should be able to repost on Monday, or so.

Thanks,

Tobias

Reply via email to