Hi all,
Andrew Stubbs wrote:
On 17/10/2025 16:18, Tobias Burnus wrote:
Slightly offtopic remark:
I think it 'only' helps with performance when
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES
is false - and can also be achieved by using cuMemHostRegister as
alternative. I wrote 'only' as the only two system I know that have
this attribute set are a Grace-Hopper GH100 system and ORNL's former
Summit system.
I do not recall on which system we observed the speed-up (a couple of
years ago), but it could not have been those systems as we did not
have access to either at the time. We did observe a real-world
speed-up using one of the x86_64 test machines we had then (however,
the PowerPC machine I tested did not demonstrate the advantage).
[Note that the PowerPC + Vega machine that we had didn't have this flag,
contrary to Summit.]
The speed up makes sense, based on Nvidia's documentation
for cuMemAllocHost/cuMemAllocHost:
"The driver tracks the virtual memory ranges allocated with this
function and automatically accelerates calls to functions such as
cuMemcpy(). Since the memory can be accessed directly by the device, it
can be read or written with much higher bandwidth than pageable memory
obtained with functions such as malloc()."
Likewise by "normal" memory allocations, followed by
cuMemHostRegister: "This memory range also is added to the same tracking
mechanism as cuMemHostAlloc to automatically accelerate calls to
functions such as cuMemcpyHtoD(). […]"
I tried to find the quote again that links the latter to
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, but I
failed.
I now assume that I misread it. Still, I assume that this feature help
here – as it avoids extra hops. Whether it sufficiently helps to avoid
calling cuMemHostRegister or only whether it still helps with
performance is unclear, but could be tested (using, e.g., GH100 machine).
But independent from whether it is required for GH100 or not, it surely
should help at least on all other systems that do not have this feature
– and it avoids (for the better or worse) the 'ulimit' for locked memory.
* * *
For completeness: The tests assume that either no Nvidia GPU
is available or the default device is an Nvidia GPU.
Otherwise, offload_device_nvptx will be false but still
cuMemHostAlloc is called.
However, that shouldn't be an issue most of the time as
libgomp first checks for CUDA devices - and the default
device becomes the first device found.
The 'checked first' seems to be true as most GCC builds configure for
--enable-offload-targets=nvptx-none,amdgcn-amdhsa (in this order,
assuming both are configured for) – also due to historic reasons (nvptx
was available first), but, actually, the runtime check ('dlopen') is
done in the order specified by this configure option.
But I have to correct my assertion that it is always in the order nvptx
then gcn.
* * *
In that case, I think we might have bigger problems trying to test on
systems that have both flavours of GPU. I have not tested that
configuration, but I have run the tests on both AMD and NVPTX
independently.
I think the number of systems that have both an Nvidia and an AMD device
with both CUDA and ROCm's ROCr installed is small. Then they need
additionally to have a different default device than the Nvidia one.
Only in this case, there will be a fail. That seems to be a rather
special case – and the only harm will be that those few pinned-memory
testcases fail; the pinned-memory handling should just work fine.
* * *
OK, I will review realloc in both patches.
Thanks. (I admittedly forgot to recheck when looking at your "PATCH v6
2/2".)
* * *
Question: How many gomp_debug calls exist for a successful
linux_memspace_calloc that uses cuMemHostRegister?
Answer: 11 - if I have not miscounted.
[…]
The most absurd is the following one, which is always
shown:
+ gomp_debug (0, " using_device=%d, using_device_old=%d\n",
+ using_device, using_device_old);
+ assert (using_device_old == -1
+ /* We shouldn't have concurrently changed our
mind. */
+ || using_device_old == using_device);
But I also find
+ for (int i = 0; i < num_devices; ++i)
+ {
+ gomp_debug (0, " i=%d, target_id=%d\n",
+ i, devices[i].target_id);
rather useless.
Thomas, do you have an opinion about these? I believe you added them?
The first one looks like a debugging item. using_device_old is either
expected to be '-1' (uninitialized) or, if some other progress ran
concurrently, the same value (no device = 0/false, some device = 1/true)
is expected.
Thus, the assert kind of makes sense, albeit it seems to be overly
bullet proof - as all reasonable cases where it could fail should have
failed before with a call to gomp_fatal.
But independent of the assert, I bet that the value of that variables
can be also be deferred from the remaining diagnostic output, unless it
is cut to zero.
For the second one, it essentially lists how many devices of a specific
type exist. But that's not really relevant whether there is one or ten
Nvidia GPUs, one or eight AMD GPUs. Again, that output makes sense to
check while developing this function, but now that it has been
implemented, I don't think that there is much insight gained from this
debugging output.
Hence:
Can this be trimmed to something sensible for users?
* * *
I should be able to repost on Monday, or so.
Thanks,
Tobias