On 2025-11-25 at 02:28 +1100, Gregory Price <[email protected]> wrote... > On Mon, Nov 24, 2025 at 10:09:37AM +1100, Alistair Popple wrote: > > On 2025-11-22 at 08:07 +1100, Gregory Price <[email protected]> wrote... > > > On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote: > > > > > > > > There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The > > former > > is mostly irrelevant for this discussion but I'm including the descriptions > > here > > for completeness. > > > > I appreciate you taking the time here. I'll maybe try to look at > updating the docs as this evolves.
I believe the DEVICE_PRIVATE bit is documented here https://www.kernel.org/doc/Documentation/vm/hmm.rst , but if there is anything there that you think needs improvement I'd be happy to look or review. I'm not sure if that was updated for DEVICE_COHERENT though. > > > But I could imagine an (overly simplistic) pattern with SPM Nodes: > > > > > > fd = open("/dev/gpu_mem", ...) > > > buf = mmap(fd, ...) > > > buf[0] > > > 1) driver takes the fault > > > 2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE) > > > 3) driver manages any special page table masks > > > Like marking pages RO/RW to manage ownership. > > > > Of course as an aside this needs to match the CPU PTEs logic (this what > > hmm_range_fault() is primarily used for). > > > > This is actually the most interesting part of series for me. I'm using > a compressed memory device as a stand-in for a memory type that requires > special page table entries (RO) to avoid compression ratios tanking > (resulting, eventually, in a MCE as there's no way to slow things down). > > You can somewhat "Get there from here" through device coherent > ZONE_DEVICE, but you still don't have access to basic services like > compaction and reclaim - which you absolutely do want for such a memory > type (for the same reasons we groom zswap and zram). > > I wonder if we can even re-use the hmm interfaces for SPM nodes to make > managing special page table policies easier as well. That seems > promising. It might depend on what exactly you're looking to do - HMM is really too parts, one for mirroring page tables and another for allowing special non-present PTEs to be setup to map a dummy ZONE_DEVICE struct page that notifies a driver when the CPU attempts access. > I said this during LSFMM: Without isolation, "memory policy" is really > just a suggestion. What we're describing here is all predicated on > isolation work, and all of a sudden much clearer examples of managing > memory on NUMA boundaries starts to make a little more sense. I very much agree with the views of memory policy that you shared in one of the other threads. I don't think it is adequate for providing isolation, and agree the isolation (and degree of isolation) is the interesting bit of the work here, at least for now. > > > > 4) driver sends the gpu the (mapping_id, pfn, index) information > > > so that gpu can map the region in its page tables. > > > > On coherent systems this often just uses HW address translation services > > (ATS), although I think the specific implementation of how page-tables are > > mirrored/shared is orthogonal to this. > > > > Yeah this part is completely foreign to me, I just presume there's some > way to tell the GPU how to recontruct the virtually contiguous setup. > That mechanism would be entirely reusable here (I assume). > > > This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today, > > except in the case of DEVICE_PRIVATE in step (5) above. In that case the > > page is > > mapped as a non-present special swap entry that triggers a driver callback > > due > > to the lack of cache coherence. > > > > Btw, just an aside, Lorenzo is moving to rename these entries to > softleaf (software-leaf) entries. I think you'll find it welcome. > https://lore.kernel.org/linux-mm/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoa...@oracle.com/ > > > > Driver doesn't have to do much in the way of allocationg management. > > > > > > This is probably less compelling since you don't want general purposes > > > services like reclaim, migration, compaction, tiering - etc. > > > > On at least some of our systems I'm told we do want this, hence my interest > > here. Currently we have systems not using DEVICE_COHERENT and instead just > > onlining everything as normal system managed memory in order to get reclaim > > and tiering. Of course then people complain that it's managed as normal > > system > > memory and non-GPU related things (ie. page-cache) end up in what's viewed > > as > > special purpose memory. > > > > Ok, so now this gets interesting then. I don't understand how this > makes sense (not saying it doesn't, I simply don't understand). > > I would presume that under no circumstance do you want device memory to > just suddenly disappear without some coordination from the driver. > > Whether it's compaction or reclaim, you have some thread that's going to > migrate a virtual mapping from HPA(A) to HPA(B) and HPA(B) may or may not > even map to the same memory device. > > That thread may not even be called in the context of a thread which > accesses GPU memory (although, I think we could enforce that on top > of SPM nodes, but devil is in the details). > > Maybe that "all magically works" because of the ATS described above? Pretty much - both ATS and hmm_range_fault() are, conceptually at least, just methods of sharing/mirroring the CPU page table to a device. So in your example above if a thread was to migrate a mapping from one page to another this "black magic" would keep everything in sync. Eg. For hmm_range_fault() the driver gets a mmu_notifier callback saying the virtual mapping no longer points to HPA(A). If it needs to find the new mapping to HPA(B) it can look it up using hmm_range_fault() and program it's page tables with the new mapping. At a sufficiently high level ATS is just a HW implemented equivalence of this. > I suppose this assumes you have some kind of unified memory view between > host and device memory? Are there docs here you can point me at that > might explain this wizardry? (Sincerely, this is fascinating) Right - it's all predicated on the host and device sharing the same view of the virtual address space. I'm not sure of any good docs on this, but I will be at LPC so would be happy to have a discussion there. > > > The value is clearly that you get to manage GPU memory like any other > > > memory, but without worry that other parts of the system will touch it. > > > > > > I'm much more focused on the "I have memory that is otherwise general > > > purpose, and wants services like reclaim and compaction, but I want > > > strong controls over how things can land there in the first place". > > > > So maybe there is some overlap here - what I have is memoy that we want > > managed > > much like normal memory but with strong controls over what it can be used > > for > > (ie. just for tasks utilising the processing element on the accelerator). > > > > I think it might be great if we could discuss this a bit more in-depth, > as i've already been considering very mild refactors to reclaim to > enable a driver to engage it with an SPM node as the only shrink target. Absolutely! Looking forward to an in-person discussion. - Alistair > This all becomes much more complicated due to per-memcg LRUs and such. > > All that said, I'm focused on the isolation / allocation pieces first. > If that can't be agreed upon, the rest isn't worth exploring. > > I do have a mild extension to mempolicy that allows mbind() to hit an > SPM node as an example as well. I'll discuss this in the response to > David's thread, as he had some related questions about the GFP flag. > > ~Gregory >

