> -----Original Message----- > From: Thomas Schwinge <tho...@codesourcery.com> > Sent: 16 February 2023 15:33 > To: Andrew Stubbs <a...@codesourcery.com>; Jakub Jelinek <ja...@redhat.com>; > Tobias Burnus <tob...@codesourcery.com>; gcc-patches@gcc.gnu.org > Subject: Attempt to register OpenMP pinned memory using a device instead of > 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) > > Hi! > > On 2022-06-09T11:38:22+0200, I wrote: > > On 2022-06-07T13:28:33+0100, Andrew Stubbs <a...@codesourcery.com> wrote: > >> On 07/06/2022 13:10, Jakub Jelinek wrote: > >>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote: > >>>> Following some feedback from users of the OG11 branch I think I need to > >>>> withdraw this patch, for now. > >>>> > >>>> The memory pinned via the mlock call does not give the expected > performance > >>>> boost. I had not expected that it would do much in my test setup, given > that > >>>> the machine has a lot of RAM and my benchmarks are small, but others > have > >>>> tried more and on varying machines and architectures. > >>> > >>> I don't understand why there should be any expected performance boost > (at > >>> least not unless the machine starts swapping out pages), > >>> { omp_atk_pinned, true } is solely about the requirement that the memory > >>> can't be swapped out. > >> > >> It seems like it takes a faster path through the NVidia drivers. This is > >> a black box, for me, but that seems like a plausible explanation. The > >> results are different on x86_64 and powerpc hosts (such as the Summit > >> supercomputer). > > > > For example, it's documented that 'cuMemHostAlloc', > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid > ia.com%2Fcuda%2Fcuda-driver- > api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e > 035b9&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1 > 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX > VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7S8K2opKAV%2F5Ub2tyZtcgplptZ65dNc3b%2F2IYoh > me%2Fw%3D&reserved=0>, > > "Allocates page-locked host memory". The crucial thing, though, what > > makes this different from 'malloc' plus 'mlock' is, that "The driver > > tracks the virtual memory ranges allocated with this function and > > automatically accelerates calls to functions such as cuMemcpyHtoD(). > > Since the memory can be accessed directly by the device, it can be read > > or written with much higher bandwidth than pageable memory obtained with > > functions such as malloc()". > > > > Similar, for example, for 'cuMemAllocHost', > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid > ia.com%2Fcuda%2Fcuda-driver- > api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gdd8311286d2c2691605362c689b > c64e0&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1 > 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX > VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TAhX%2BFjPavhKZKICMDiO%2BuZuytxnkaDvfDArT0R > KDV0%3D&reserved=0>. > > > > This, to me, would explain why "the mlock call does not give the expected > > performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc'; > > with 'mlock' you're missing the "tracks the virtual memory ranges" > > aspect. > > > > Also, by means of the Nvidia Driver allocating the memory, I suppose > > using this interface likely circumvents any "annoying" 'ulimit' > > limitations? I get this impression, because documentation continues > > stating that "Allocating excessive amounts of memory with > > cuMemAllocHost() may degrade system performance, since it reduces the > > amount of memory available to the system for paging. As a result, this > > function is best used sparingly to allocate staging areas for data > > exchange between host and device". > > > >>>> It seems that it isn't enough for the memory to be pinned, it has to be > >>>> pinned using the Cuda API to get the performance boost. > >>> > >>> For performance boost of what kind of code? > >>> I don't understand how Cuda API could be useful (or can be used at all) > if > >>> offloading to NVPTX isn't involved. The fact that somebody asks for > host > >>> memory allocation with omp_atk_pinned set to true doesn't mean it will > be > >>> in any way related to NVPTX offloading (unless it is in NVPTX target > region > >>> obviously, but then mlock isn't available, so sure, if there is > something > >>> CUDA can provide for that case, nice). > >> > >> This is specifically for NVPTX offload, of course, but then that's what > >> our customer is paying for. > >> > >> The expectation, from users, is that memory pinning will give the > >> benefits specific to the active device. We can certainly make that > >> happen when there is only one (flavour of) offload device present. I had > >> hoped it could be one way for all, but it looks like not. > > > > Aren't there CUDA Driver interfaces for that? That is: > > > >>>> I had not done this > >>>> this because it was difficult to resolve the code abstraction > >>>> difficulties and anyway the implementation was supposed to be device > >>>> independent, but it seems we need a specific pinning mechanism for each > >>>> device. > > > > If not directly *allocating and registering* such memory via > > 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only > > *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister', > > > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid > ia.com%2Fcuda%2Fcuda-driver- > api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b > 54223&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1 > 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX > VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wkwx9TipC8JJNn1QqULahoTfqn9w%2FOLyoCQ1MTt90 > 8M%3D&reserved=0>: > > "Page-locks the memory range specified [...] and maps it for the > > device(s) [...]. This memory range also is added to the same tracking > > mechanism as cuMemHostAlloc to automatically accelerate [...]"? (No > > manual 'mlock'ing involved in that case, too; presumably again using this > > interface likely circumvents any "annoying" 'ulimit' limitations?) > > > > Such a *register* abstraction can then be implemented by all the libgomp > > offloading plugins: they just call the respective > > CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.) > > memory. > > > > ..., but maybe I'm missing some crucial "detail" here? > > Indeed this does appear to work; see attached > "[WIP] Attempt to register OpenMP pinned memory using a device instead of > 'mlock'". > Any comments (aside from the TODOs that I'm still working on)?
The mmap implementation was not optimized for a lot of small allocations, and I can't see that issue changing here, so I don't know if this can be used for mlockall replacement. I had assumed that using the Cuda allocator would fix that limitation. Andrew