RE: Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)

Stubbs, Andrew via Gcc-patches Thu, 16 Feb 2023 08:17:45 -0800

> -----Original Message-----
> From: Thomas Schwinge <tho...@codesourcery.com>
> Sent: 16 February 2023 15:33
> To: Andrew Stubbs <a...@codesourcery.com>; Jakub Jelinek <ja...@redhat.com>;
> Tobias Burnus <tob...@codesourcery.com>; gcc-patches@gcc.gnu.org
> Subject: Attempt to register OpenMP pinned memory using a device instead of
> 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)
> 
> Hi!
> 
> On 2022-06-09T11:38:22+0200, I wrote:
> > On 2022-06-07T13:28:33+0100, Andrew Stubbs <a...@codesourcery.com> wrote:
> >> On 07/06/2022 13:10, Jakub Jelinek wrote:
> >>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
> >>>> Following some feedback from users of the OG11 branch I think I need to
> >>>> withdraw this patch, for now.
> >>>>
> >>>> The memory pinned via the mlock call does not give the expected
> performance
> >>>> boost. I had not expected that it would do much in my test setup, given
> that
> >>>> the machine has a lot of RAM and my benchmarks are small, but others
> have
> >>>> tried more and on varying machines and architectures.
> >>>
> >>> I don't understand why there should be any expected performance boost
> (at
> >>> least not unless the machine starts swapping out pages),
> >>> { omp_atk_pinned, true } is solely about the requirement that the memory
> >>> can't be swapped out.
> >>
> >> It seems like it takes a faster path through the NVidia drivers. This is
> >> a black box, for me, but that seems like a plausible explanation. The
> >> results are different on x86_64 and powerpc hosts (such as the Summit
> >> supercomputer).
> >
> > For example, it's documented that 'cuMemHostAlloc',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e
> 035b9&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7S8K2opKAV%2F5Ub2tyZtcgplptZ65dNc3b%2F2IYoh
> me%2Fw%3D&reserved=0>,
> > "Allocates page-locked host memory".  The crucial thing, though, what
> > makes this different from 'malloc' plus 'mlock' is, that "The driver
> > tracks the virtual memory ranges allocated with this function and
> > automatically accelerates calls to functions such as cuMemcpyHtoD().
> > Since the memory can be accessed directly by the device, it can be read
> > or written with much higher bandwidth than pageable memory obtained with
> > functions such as malloc()".
> >
> > Similar, for example, for 'cuMemAllocHost',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gdd8311286d2c2691605362c689b
> c64e0&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TAhX%2BFjPavhKZKICMDiO%2BuZuytxnkaDvfDArT0R
> KDV0%3D&reserved=0>.
> >
> > This, to me, would explain why "the mlock call does not give the expected
> > performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
> > with 'mlock' you're missing the "tracks the virtual memory ranges"
> > aspect.
> >
> > Also, by means of the Nvidia Driver allocating the memory, I suppose
> > using this interface likely circumvents any "annoying" 'ulimit'
> > limitations?  I get this impression, because documentation continues
> > stating that "Allocating excessive amounts of memory with
> > cuMemAllocHost() may degrade system performance, since it reduces the
> > amount of memory available to the system for paging.  As a result, this
> > function is best used sparingly to allocate staging areas for data
> > exchange between host and device".
> >
> >>>> It seems that it isn't enough for the memory to be pinned, it has to be
> >>>> pinned using the Cuda API to get the performance boost.
> >>>
> >>> For performance boost of what kind of code?
> >>> I don't understand how Cuda API could be useful (or can be used at all)
> if
> >>> offloading to NVPTX isn't involved.  The fact that somebody asks for
> host
> >>> memory allocation with omp_atk_pinned set to true doesn't mean it will
> be
> >>> in any way related to NVPTX offloading (unless it is in NVPTX target
> region
> >>> obviously, but then mlock isn't available, so sure, if there is
> something
> >>> CUDA can provide for that case, nice).
> >>
> >> This is specifically for NVPTX offload, of course, but then that's what
> >> our customer is paying for.
> >>
> >> The expectation, from users, is that memory pinning will give the
> >> benefits specific to the active device. We can certainly make that
> >> happen when there is only one (flavour of) offload device present. I had
> >> hoped it could be one way for all, but it looks like not.
> >
> > Aren't there CUDA Driver interfaces for that?  That is:
> >
> >>>> I had not done this
> >>>> this because it was difficult to resolve the code abstraction
> >>>> difficulties and anyway the implementation was supposed to be device
> >>>> independent, but it seems we need a specific pinning mechanism for each
> >>>> device.
> >
> > If not directly *allocating and registering* such memory via
> > 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> > *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
> >
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvid
> ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b
> 54223&data=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wkwx9TipC8JJNn1QqULahoTfqn9w%2FOLyoCQ1MTt90
> 8M%3D&reserved=0>:
> > "Page-locks the memory range specified [...] and maps it for the
> > device(s) [...].  This memory range also is added to the same tracking
> > mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> > manual 'mlock'ing involved in that case, too; presumably again using this
> > interface likely circumvents any "annoying" 'ulimit' limitations?)
> >
> > Such a *register* abstraction can then be implemented by all the libgomp
> > offloading plugins: they just call the respective
> > CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> > memory.
> >
> > ..., but maybe I'm missing some crucial "detail" here?
> 
> Indeed this does appear to work; see attached
> "[WIP] Attempt to register OpenMP pinned memory using a device instead of
> 'mlock'".
> Any comments (aside from the TODOs that I'm still working on)?


The mmap implementation was not optimized for a lot of small allocations, and I 
can't see that issue changing here, so I don't know if this can be used for 
mlockall replacement.

I had assumed that using the Cuda allocator would fix that limitation.

Andrew

RE: Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)

Reply via email to