On 2024-04-16 10:04, Tvrtko Ursulin wrote: > > On 01/04/2024 18:58, Felix Kuehling wrote: >> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: >>> >>> On 01/04/2024 17:37, Felix Kuehling wrote: >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote: >>>>> >>>>> On 28/03/2024 20:42, Felix Kuehling wrote: >>>>>> >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote: >>>>>>> >>>>>>> Hi Felix, >>>>>>> >>>>>>> I had one more thought while browsing around the amdgpu CRIU plugin. It >>>>>>> appears it relies on the KFD support being compiled in and /dev/kfd >>>>>>> present, correct? AFAICT at least, it relies on that to figure out the >>>>>>> amdgpu DRM node. >>>>>>> >>>>>>> In would be probably good to consider designing things without that >>>>>>> dependency. So that checkpointing an application which does not use >>>>>>> /dev/kfd is possible. Or if the kernel does not even have the KFD >>>>>>> support compiled in. >>>>>> >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we should >>>>>> definitely do that. Currently we get a lot of topology information from >>>>>> KFD, not even from the /dev/kfd device but from the sysfs nodes exposed >>>>>> by KFD. We'd need to get GPU device info from the render nodes instead. >>>>>> And if KFD is available, we may need to integrate both sources of >>>>>> information. >>>>>> >>>>>> >>>>>>> >>>>>>> It could perhaps mean no more than adding some GPU discovery code into >>>>>>> CRIU. Which shuold be flexible enough to account for things like >>>>>>> re-assigned minor numbers due driver reload. >>>>>> >>>>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I >>>>>> was thinking this is still part of the plugin. >>>>> >>>>> Yes I agree. I was only thinking about adding some DRM device discovery >>>>> code in a more decoupled fashion from the current plugin, for both the >>>>> reason discussed above (decoupling a bit from reliance on kfd sysfs), and >>>>> then also if/when a new DRM driver might want to implement this the code >>>>> could be move to some common plugin area. >>>>> >>>>> I am not sure how feasible that would be though. The "gpu id" concept and >>>>> it's matching in the current kernel code and CRIU plugin - is that value >>>>> tied to the physical GPU instance or how it works? >>>> >>>> The concept of the GPU ID is that it's stable while the system is up, even >>>> when devices get added and removed dynamically. It was baked into the API >>>> early on, but I don't think we ever fully validated device hot plug. I >>>> think the closest we're getting is with our latest MI GPUs and dynamic >>>> partition mode change. >>> >>> Doesn't it read the saved gpu id from the image file while doing restore >>> and tries to open the render node to match it? Maybe I am misreading the >>> code.. But if it does, does it imply that in practice it could be stable >>> across reboots? Or that it is not possible to restore to a different >>> instance of maybe the same GPU model installed in a system? >> >> Ah, the idea is, that when you restore on a different system, you may get >> different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore >> it on GPU 2 on the same system. That's why we need to translate GPU IDs in >> restored applications. User mode still uses the old GPU IDs, but the kernel >> mode driver translates them to the actual GPU IDs of the GPUs that the >> process was restored on. > > I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but > for restored clients it gets overriden during restore so that any further > ioctls can actually not instantly fail. > > And then in amdgpu_plugin_restore_file, when it is opening the render node, > it relies on the kfd topology to have filled in (more or less) the > target_gpu_id corresponding to the render node gpu id of the target GPU - the > one associated with the new kfd gpu_id? Yes. > > I am digging into this because I am trying to see if some part of GPU > discovery could somehow be decoupled.. to offer you to work on at least that > until you start to tackle the main body of the feature. But it looks properly > tangled up. OK. Most of the interesting plugin code should be in amdgpu_plugin_topology.c. It currently has some pretty complicated logic to find a set of devices that matches the topology in the checkpoint, including shader ISA versions, numbers of compute units, memory sizes, firmware versions and IO-Links between GPUs. This was originally done to support P2P with XGMI links. I'm not sure we ever updated it to properly support PCIe P2P. > > Do you have any suggestions with what I could help with? Maybe developing > some sort of drm device enumeration library if you see a way that would be > useful in decoupling the device discovery from kfd. We would need to define > what sort of information you would need to be queryable from it. Maybe. I think a lot of device information is available with some amdgpu info-ioctl. It may not cover all the things we're checking in the KFD topology, though. > >>>> This also highlights another aspect on those spatially partitioned GPUs. >>>> GPU IDs identify device partitions, not devices. Similarly, each partition >>>> has its own render node, and the KFD topology info in sysfs points to the >>>> render-minor number corresponding to each GPU ID. >>> >>> I am not familiar with this. This is not SR-IOV but some other kind of >>> partitioning? Would you have any links where I could read more? >> >> Right, the bare-metal driver can partition a PF spatially without SRIOV. >> SRIOV can also use spatial partitioning and expose each partition through >> its own VF, but that's not useful for bare metal. Spatial partitioning is >> new in MI300. There is some high-level info in this whitepaper: >> https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf. > > From the outside (userspace) this looks simply like multiple DRM render nodes > or how exactly? Yes, that's correct. Each partition has its own render node and its own node in the KFD topology. Regards, Felix > > Regards, > > Tvrtko > >> >> Regards, >> Felix >> >> >>> >>> Regards, >>> >>> Tvrtko >>> >>>>>>> Otherwise I am eagerly awaiting to hear more about the design specifics >>>>>>> around dma-buf handling. And also seeing how to extend to other DRM >>>>>>> related anonymous fds. >>>>>> >>>>>> I've been pretty far under-water lately. I hope I'll find time to work >>>>>> on this more, but it's probably going to be at least a few weeks. >>>>> >>>>> Got it. >>>>> >>>>> Regards, >>>>> >>>>> Tvrtko >>>>> >>>>>> >>>>>> Regards, >>>>>> Felix >>>>>> >>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Tvrtko >>>>>>> >>>>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote: >>>>>>>> >>>>>>>> On 15/03/2024 02:33, Felix Kuehling wrote: >>>>>>>>> >>>>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote: >>>>>>>>>> >>>>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Felix, >>>>>>>>>>> >>>>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote: >>>>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render nodes >>>>>>>>>>>> in order to maintain CRIU support for ROCm application once they >>>>>>>>>>>> start relying on render nodes for more GPU memory management. In >>>>>>>>>>>> this email I'm providing some background why we are doing this, >>>>>>>>>>>> and outlining some of the problems we need to solve to checkpoint >>>>>>>>>>>> and restore render node state and shared memory (DMABuf) state. I >>>>>>>>>>>> have some thoughts on the API design, leaning on what we did for >>>>>>>>>>>> KFD, but would like to get feedback from the DRI community >>>>>>>>>>>> regarding that API and to what extent there is interest in making >>>>>>>>>>>> that generic. >>>>>>>>>>>> >>>>>>>>>>>> We are working on using DRM render nodes for virtual address >>>>>>>>>>>> mappings in ROCm applications to implement the CUDA11-style VM API >>>>>>>>>>>> and improve interoperability between graphics and compute. This >>>>>>>>>>>> uses DMABufs for sharing buffer objects between KFD and multiple >>>>>>>>>>>> render node devices, as well as between processes. In the long run >>>>>>>>>>>> this also provides a path to moving all or most memory management >>>>>>>>>>>> from the KFD ioctl API to libdrm. >>>>>>>>>>>> >>>>>>>>>>>> Once ROCm user mode starts using render nodes for virtual address >>>>>>>>>>>> management, that creates a problem for checkpointing and restoring >>>>>>>>>>>> ROCm applications with CRIU. Currently there is no support for >>>>>>>>>>>> checkpointing and restoring render node state, other than CPU >>>>>>>>>>>> virtual address mappings. Support will be needed for checkpointing >>>>>>>>>>>> GEM buffer objects and handles, their GPU virtual address mappings >>>>>>>>>>>> and memory sharing relationships between devices and processes. >>>>>>>>>>>> >>>>>>>>>>>> Eventually, if full CRIU support for graphics applications is >>>>>>>>>>>> desired, more state would need to be captured, including scheduler >>>>>>>>>>>> contexts and BO lists. Most of this state is driver-specific. >>>>>>>>>>>> >>>>>>>>>>>> After some internal discussions we decided to take our design >>>>>>>>>>>> process public as this potentially touches DRM GEM and DMABuf APIs >>>>>>>>>>>> and may have implications for other drivers in the future. >>>>>>>>>>>> >>>>>>>>>>>> One basic question before going into any API details: Is there a >>>>>>>>>>>> desire to have CRIU support for other DRM drivers? >>>>>>>>>>> >>>>>>>>>>> This sounds like a very interesting feature on the overall, >>>>>>>>>>> although I cannot answer on the last question here. >>>>>>>>>> >>>>>>>>>> I forgot to finish this thought. I cannot answer / don't know of any >>>>>>>>>> concrete plans, but I think feature is pretty cool and if amdgpu >>>>>>>>>> gets it working I wouldn't be surprised if other drivers would get >>>>>>>>>> interested. >>>>>>>>> >>>>>>>>> Thanks, that's good to hear! >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I >>>>>>>>>>> recently implemented on Mesa's request, which is to be able to >>>>>>>>>>> "upload" the GPU context from the GPU hang error state and replay >>>>>>>>>>> the hanging request. It is kind of (at a stretch) a very special >>>>>>>>>>> tiny subset of checkout and restore so I am not mentioning it as a >>>>>>>>>>> curiosity. >>>>>>>>>>> >>>>>>>>>>> And there is also another partical conceptual intersect with the >>>>>>>>>>> (at the moment not yet upstream) i915 online debugger. This part >>>>>>>>>>> being in the area of discovering and enumerating GPU resources >>>>>>>>>>> beloning to the client. >>>>>>>>>>> >>>>>>>>>>> I don't see an immediate design or code sharing opportunities >>>>>>>>>>> though but just mentioning. >>>>>>>>>>> >>>>>>>>>>> I did spend some time reading your plugin and kernel implementation >>>>>>>>>>> out of curiousity and have some comments and questions. >>>>>>>>>>> >>>>>>>>>>>> With that out of the way, some considerations for a possible DRM >>>>>>>>>>>> CRIU API (either generic of AMDGPU driver specific): The API goes >>>>>>>>>>>> through several phases during checkpoint and restore: >>>>>>>>>>>> >>>>>>>>>>>> Checkpoint: >>>>>>>>>>>> >>>>>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode can >>>>>>>>>>>> allocate >>>>>>>>>>>> memory for the checkpoint, stops execution on the GPU) >>>>>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.) >>>>>>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete) >>>>>>>>>>>> >>>>>>>>>>>> Restore: >>>>>>>>>>>> >>>>>>>>>>>> 1. Restore (restore objects, VMAs are not in the right place at >>>>>>>>>>>> this time) >>>>>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume >>>>>>>>>>>> execution) >>>>>>>>>>> >>>>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is idled? >>>>>>>>>>> For instance dma_resv objects are free of fences which would need >>>>>>>>>>> to restored for things to continue executing sensibly? Or how is >>>>>>>>>>> that handled? >>>>>>>>> >>>>>>>>> In our compute use cases, we suspend user mode queues. This can >>>>>>>>> include CWSR (compute-wave-save-restore) where the state of in-flight >>>>>>>>> waves is stored in memory and can be reloaded and resumed from memory >>>>>>>>> later. We don't use any fences other than "eviction fences", that are >>>>>>>>> signaled after the queues are suspended. And those fences are never >>>>>>>>> handed to user mode. So we don't need to worry about any fence state >>>>>>>>> in the checkpoint. >>>>>>>>> >>>>>>>>> If we extended this to support the kernel mode command submission >>>>>>>>> APIs, I would expect that we'd wait for all current submissions to >>>>>>>>> complete, and stop new ones from being sent to the HW before taking >>>>>>>>> the checkpoint. When we take the checkpoint in the CRIU plugin, the >>>>>>>>> CPU threads are already frozen and cannot submit any more work. If we >>>>>>>>> wait for all currently pending submissions to drain, I think we don't >>>>>>>>> need to save any fence state because all the fences will have >>>>>>>>> signaled. (I may be missing some intricacies and I'm afraid it may >>>>>>>>> not be that simple in reality, but that's my opening bid. ;) >>>>>>>> >>>>>>>> It feels feasible to me too, for the normally behaving clients at >>>>>>>> least. >>>>>>>> >>>>>>>> Presumably, given that the whole checkpointing is not instant, it >>>>>>>> would be okay to wait a second or two longer for the in-progress >>>>>>>> submissions complete. After which kernel would need to prune all >>>>>>>> signalled fences from the respective container objects before >>>>>>>> checkpointing. >>>>>>>> >>>>>>>> For the "misbehaving" clients who have perhaps queued up too much >>>>>>>> work, either still in the scheduler with unsatisfied dependencies, or >>>>>>>> already submitted to the hardware and/or driver backend, is there a >>>>>>>> timeout concept in CRIU so it would be possible to say something like >>>>>>>> "try to checkpoint but if the kernel says no time period t then give >>>>>>>> up"? >>>>>>>> >>>>>>>>>>>> For some more background about our implementation in KFD, you can >>>>>>>>>>>> refer to this whitepaper: >>>>>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md >>>>>>>>>>>> >>>>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render nodes, >>>>>>>>>>>> I'll address each of them in more detail below: >>>>>>>>>>>> >>>>>>>>>>>> * Opaque information in the checkpoint data that user mode can't >>>>>>>>>>>> interpret or do anything with >>>>>>>>>>>> * A second API for creating objects (e.g. BOs) that is separate >>>>>>>>>>>> from >>>>>>>>>>>> the regular BO creation API >>>>>>>>>>>> * Kernel mode would need to be involved in restoring BO sharing >>>>>>>>>>>> relationships rather than replaying BO creation, export and >>>>>>>>>>>> import >>>>>>>>>>>> from user mode >>>>>>>>>>>> >>>>>>>>>>>> # Opaque information in the checkpoint >>>>>>>>>>>> >>>>>>>>>>>> This comes out of ABI compatibility considerations. Adding any new >>>>>>>>>>>> objects or attributes to the driver/HW state that needs to be >>>>>>>>>>>> checkpointed could potentially break the ABI of the CRIU >>>>>>>>>>>> checkpoint/restore ioctl if the plugin needs to parse that >>>>>>>>>>>> information. Therefore, much of the information in our KFD CRIU >>>>>>>>>>>> ioctl API is opaque. It is written by kernel mode in the >>>>>>>>>>>> checkpoint, it is consumed by kernel mode when restoring the >>>>>>>>>>>> checkpoint, but user mode doesn't care about the contents or >>>>>>>>>>>> binary layout, so there is no user mode ABI to break. This is how >>>>>>>>>>>> we were able to maintain CRIU support when we added the SVM API to >>>>>>>>>>>> KFD without changing the CRIU plugin and without breaking our ABI. >>>>>>>>>>>> >>>>>>>>>>>> Opaque information may also lend itself to API abstraction, if >>>>>>>>>>>> this becomes a generic DRM API with driver-specific callbacks that >>>>>>>>>>>> fill in HW-specific opaque data. >>>>>>>>>>> >>>>>>>>>>> This feels sound in principle to me. Fundamentally the state is >>>>>>>>>>> very hardware specfic, and/or driver version specific, so I don't >>>>>>>>>>> see anything could be gained in practice by making it much less >>>>>>>>>>> opaque. (Apart from making things more complicated.) >>>>>>>>>>> >>>>>>>>>>> I was however unsure of the current split of how you dump buffer >>>>>>>>>>> objects with some data in the defined bo structure, and some in >>>>>>>>>>> completely opaque private data. Is there a benefit to that split, >>>>>>>>>>> or maybe in other words, is there a benefit on having part >>>>>>>>>>> transparent and part opaque for buffer objects? >>>>>>>>> >>>>>>>>> Some of the buffer object state is needed by the plugin. E.g. the >>>>>>>>> size and mmap offset are needed to match VMAs with BOs. I'd have to >>>>>>>>> review the plugin in detail to prove that all the fields are, in >>>>>>>>> fact, needed by the plugin, but that was the intent. Anything that >>>>>>>>> the plugin doesn't need to know should be in the opaque data >>>>>>>>> structures. >>>>>>>> >>>>>>>> Right, got it. >>>>>>>> >>>>>>>> Would it make sense to make the opaque data in the same block as the >>>>>>>> defined one? I mean instead of separating the two in the binary image >>>>>>>> for instance have struct kfd_criu_bo_bucket have a trailing priv_data >>>>>>>> blob? Maybe it is too late now if the image format is not versioned or >>>>>>>> something. >>>>>>>> >>>>>>>>>>> To slightly touch upon the question of whether this could become a >>>>>>>>>>> generic DRM API. It feels it would be hard to do it from the start. >>>>>>>>>>> What sounds more feasible is if/when generic looking helpers can be >>>>>>>>>>> spotted while developing the RFC then potentially structure the >>>>>>>>>>> code they can easily be promoted to shared/common at some future >>>>>>>>>>> moment. >>>>>>>>> >>>>>>>>> Yes, that's how this usually goes, in my experience. Thanks for >>>>>>>>> confirming. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> # Second API for creating objects >>>>>>>>>>>> >>>>>>>>>>>> Creating BOs and other objects when restoring a checkpoint needs >>>>>>>>>>>> more information than the usual BO alloc and similar APIs provide. >>>>>>>>>>>> For example, we need to restore BOs with the same GEM handles so >>>>>>>>>>>> that user mode can continue using those handles after resuming >>>>>>>>>>>> execution. If BOs are shared through DMABufs without dynamic >>>>>>>>>>>> attachment, we need to restore pinned BOs as pinned. Validation of >>>>>>>>>>>> virtual addresses and handling MMU notifiers must be suspended >>>>>>>>>>>> until the virtual address space is restored. For user mode queues >>>>>>>>>>>> we need to save and restore a lot of queue execution state so that >>>>>>>>>>>> execution can resume cleanly. >>>>>>>>>>> >>>>>>>>>>> This also sounds justified to me. Restore creating all internal >>>>>>>>>>> objects is definitely special and sounds better to add uapi to >>>>>>>>>>> create them directly with the correct properties, than to add uapi >>>>>>>>>>> to adjust internal properties after creation. And in case you would >>>>>>>>>>> always need some new uapi - so at least to adjust after creation. >>>>>>>>>>> At which point you may have both in one. Internally implementation >>>>>>>>>>> can be split or common, whatever makes sense for a given object >>>>>>>>>>> type, but new uapi definitely sounds is required. >>>>>>>>>>>> # Restoring buffer sharing relationships >>>>>>>>>>>> >>>>>>>>>>>> Different GEM handles in different render nodes and processes can >>>>>>>>>>>> refer to the same underlying shared memory, either by directly >>>>>>>>>>>> pointing to the same GEM object, or by creating an import >>>>>>>>>>>> attachment that may get its SG tables invalidated and updated >>>>>>>>>>>> dynamically through dynamic attachment callbacks. In the latter >>>>>>>>>>>> case it's obvious, who is the exporter and who is the importer. In >>>>>>>>>>>> the first case, either one could be the exporter, and it's not >>>>>>>>>>>> clear who would need to create the BO and who would need to >>>>>>>>>>> >>>>>>>>>>> To see if I follow the former case correctly. >>>>>>>>>>> >>>>>>>>>>> This could be two clients A and B, where B has imported a dma-buf >>>>>>>>>>> shared BO from A and has since closed the dma-buf fd? Which results >>>>>>>>>>> in a single BO with reference count of 2 and obj->import_attach >>>>>>>>>>> unset. History of who created the object is lost. >>>>>>>>> >>>>>>>>> Yes. In the amdgpu driver this happens when the exporter and import >>>>>>>>> device are the same. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In fact it could even be that two imported objects remain (clients >>>>>>>>>>> A, B and C) and A, who originally created the BO, has since fully >>>>>>>>>>> released it. So any kind of "creator" tracking if added wouldn't be >>>>>>>>>>> fully reliable either. >>>>>>>>> >>>>>>>>> That's a good point. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> import it when restoring the checkpoint. To further complicate >>>>>>>>>>>> things, multiple processes in a checkpoint get restored >>>>>>>>>>>> concurrently. So there is no guarantee that an exporter has >>>>>>>>>>>> restored a shared BO at the time an importer is trying to restore >>>>>>>>>>>> its import. >>>>>>>>>>>> >>>>>>>>>>>> A proposal to deal with these problems would be to treat importers >>>>>>>>>>>> and exporters the same. Whoever restores first, ends up creating >>>>>>>>>>>> the BO and potentially attaching to it. The other process(es) can >>>>>>>>>>>> find BOs that were already restored by another process by looking >>>>>>>>>>>> it up with a unique ID that could be based on the DMABuf inode >>>>>>>>>>>> number. An alternative would be a two-pass approach that needs to >>>>>>>>>>>> restore BOs on two passes: >>>>>>>>>>>> >>>>>>>>>>>> 1. Restore exported BOs >>>>>>>>>>>> 2. Restore imports >>>>>>>>>>>> >>>>>>>>>>>> With some inter-process synchronization in CRIU itself between >>>>>>>>>>>> these two passes. This may require changes in the core CRIU, >>>>>>>>>>>> outside our plugin. Both approaches depend on identifying BOs with >>>>>>>>>>>> some unique ID that could be based on the DMABuf inode number in >>>>>>>>>>>> the checkpoint. However, we would need to identify the processes >>>>>>>>>>>> in the same restore session, possibly based on parent/child >>>>>>>>>>>> process relationships, to create a scope where those IDs are valid >>>>>>>>>>>> during restore. >>>>>>>>>>> >>>>>>>>>>> If my understanding above is on the right track, then I think this >>>>>>>>>>> is the only thing which can be done (for all scenarios). >>>>>>>>> >>>>>>>>> I presented two alternatives. I think you're in favor of the first >>>>>>>>> one, where it doesn't matter who is the importer and exporter. I >>>>>>>>> think the two-pass approach requires that you can identify an >>>>>>>>> exporter. And as you pointed out, the exporter may already have >>>>>>>>> dropped their reference to the BO. >>>>>>>> >>>>>>>> Yep. >>>>>>>> >>>>>>>>>>> I also *think* it would be safe. At least at the moment I cannot >>>>>>>>>>> think what could go wrong. Semantics are that it doesn't really >>>>>>>>>>> matter who created the object. >>>>>>>>> >>>>>>>>> I would agree. What matters is that the object is recreated on the >>>>>>>>> correct device, and that all the direct references and import >>>>>>>>> attachments pointing to it are restored. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf file >>>>>>>>>>>> descriptors themselves. These are anonymous file descriptors. The >>>>>>>>>>>> CRIU plugin could probably be taught to recreate them from the >>>>>>>>>>>> original exported BO based on the inode number that could be >>>>>>>>>>>> queried with fstat in the checkpoint. It would need help from the >>>>>>>>>>>> render node CRIU API to find the right BO from the inode, which >>>>>>>>>>>> may be from a different process in the same restore session. >>>>>>>>>>> >>>>>>>>>>> This part feels like it is breaking the component separation a bit >>>>>>>>>>> because even for buffers fully owned by amdgpu, strictly speaking >>>>>>>>>>> the dma-buf fd is not. At least my understanding from the above is >>>>>>>>>>> that you propose to attempt to import the fd, from the kernel side, >>>>>>>>>>> during the checkpoint process? Like: >>>>>>>>>>> >>>>>>>>>>> Checkpoint: >>>>>>>>>>> >>>>>>>>>>> CRIU for each anon fd: >>>>>>>>>>> amdgpu_plugin(fd) >>>>>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel via >>>>>>>>>>> ioctl?) >>>>>>>>>>> -> is it ours? (no -> error) >>>>>>>>>>> -> create a record mapping fd number to amdgpu BO >>>>>>>>>>> >>>>>>>>>>> Restore: >>>>>>>>>>> >>>>>>>>>>> for each dma-buf fd record: >>>>>>>>>>> create BO if does not exists >>>>>>>>>>> export BO to same fd >>>>>>>>>>> close BO handle if not in regular BO handle records >>>>>>>>>>> >>>>>>>>>>> Or since you mention lookup by inode, that would need to be >>>>>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount space. >>>>>>>>>>> However how would it co-operate with amdgpu_plugin is not clear to >>>>>>>>>>> me. >>>>>>>>> >>>>>>>>> The way I think about the ownership is, whichever driver created the >>>>>>>>> underlying BO owns the checkpointing of the dmabuf. You need >>>>>>>>> driver-specific information to link the dmabuf with the driver's BO >>>>>>>>> and you need the right driver to recreate the BO and the dmabuf fd >>>>>>>>> when restoring the checkpoint. >>>>>>>>> >>>>>>>>> It gets really interesting if you have an amdgpu plugin and an i915 >>>>>>>>> plugin, and they checkpoint an application that shares BOs between >>>>>>>>> the two devices through DMABufs. E.g. if i915 created a BO and amdgpu >>>>>>>>> imported it, then during restore, i915 needs to restore the dmabuf >>>>>>>>> before the amdgpu import of it can be restored. I think that brings >>>>>>>>> us back to a two-phase approach to restoring the memory sharing >>>>>>>>> relationships. Uff. >>>>>>>> >>>>>>>> I think this part of the discussion somewhat depends on the previous >>>>>>>> part about idling. If it is feasible to completely idle and prune, and >>>>>>>> fail if that is not happening quickly enough, then maybe there >>>>>>>> wouldn't be too much hierarchical state to save. >>>>>>>> >>>>>>>> Otherwise my idea was that there is a top-level drm_plugin.so which >>>>>>>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new >>>>>>>> uapi to uniquely identify each, associate with the correct driver, and >>>>>>>> then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. >>>>>>>> Building the in memory representation of their relationships. As long >>>>>>>> as all objects and their relationships have been recorded I think >>>>>>>> everything could then be correctly restored. >>>>>>>> >>>>>>>> It is possible there is flaw in my thinking and that something in CRIU >>>>>>>> design would make this impossible? I think it would require the >>>>>>>> top-level drm_plugin.so to hold all state in memory until the whole >>>>>>>> checkpointing is done, and then verify something is not incomplete, >>>>>>>> failing it all if it was. (For instance one plugin discovered an >>>>>>>> reference to an object which was not discoverd by any other plugin or >>>>>>>> things like that.) May need some further tweaks to CRIU common code. >>>>>>>> >>>>>>>> Maybe I need to better understand how exactly you mean to query the >>>>>>>> DRM driver about random anonymous fds. I see it as a problem in the >>>>>>>> design, possibly even implementation, but maybe I am missing something >>>>>>>> which makes it not so. I mean even with my general idea I don't know >>>>>>>> how would one determine which driver to query about a particular >>>>>>>> anonymous inode. >>>>>>>> >>>>>>>>>> I later also realised that I was maybe increasing the scope for you >>>>>>>>>> here. :) You did state focus is ROCm applications which possibly >>>>>>>>>> doesn't care about dma-resv, fences, syncobjs etc? >>>>>>>>> >>>>>>>>> That's my focus for now. But I don't want to engineer a solution that >>>>>>>>> would preclude your use cases in the future. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> But I think the "how to handle dma-bufs" design question is still >>>>>>>>>> relevant and interesting. For example I had this thought that >>>>>>>>>> perhaps what would be needed is a CRIU plugin hierarchy. >>>>>>>>>> >>>>>>>>>> Because fundamentally we would be snapshoting a hierarcy of kernel >>>>>>>>>> objects belonging to different drivers (kfd, amdgpu, dma-buf, ...). >>>>>>>>>> And if one day someone would to try to handle dma fences and drm >>>>>>>>>> syncobjs, the argument for a hierarchial design would be even >>>>>>>>>> stronger I think. >>>>>>>>>> >>>>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu, >>>>>>>>>> dma-buf, sync file, ...) and internally build the representation of >>>>>>>>>> the whole state and how the relationship between the objects. >>>>>>>>> >>>>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not >>>>>>>>> sure it's strictly a hierarchy. Maybe more like some common code >>>>>>>>> shared by multiple GPU driver plugins. I think the common checkpoint >>>>>>>>> state is quite limited and restoring it requires the GPU-specific >>>>>>>>> drivers anyway. >>>>>>>>> >>>>>>>>> Also the idea of building a representation of the whole state doesn't >>>>>>>>> work well with the CRIU design, because "the whole state" can include >>>>>>>>> multiple processes that restore themselves concurrently and only >>>>>>>>> synchronize with each other in a few places in the restore process. I >>>>>>>>> feel, if we can work out how to checkpoint and restore shared objects >>>>>>>>> between processes, we can do the same for shared objects between >>>>>>>>> drivers without imposing a strict hierarchy and some omniscient >>>>>>>>> entity that needs to know "the whole state". >>>>>>>> >>>>>>>> Okay, this continues on the same problem space as above. And you >>>>>>>> obviously know how CRIU works much better than me. >>>>>>>> >>>>>>>>>> With that kind of design there probably would be a need to define >>>>>>>>>> some common kernel side api and uapi, so all involved objects can be >>>>>>>>>> enumerated with some unique ids etc. >>>>>>>>>> >>>>>>>>>> Now.. the counter argument.. the more state from different drivers >>>>>>>>>> would one want to handle the bigger this project would get. Would it >>>>>>>>>> even be feasible is the question, to the point that it may be >>>>>>>>>> simpler to just run the workload in a VM via SR-IOV and simply >>>>>>>>>> hibernate the whole thin guest. :) >>>>>>>>> >>>>>>>>> Well, CRIU kind of tries to do that, but with containers instead of >>>>>>>>> VMs. ;) >>>>>>>> >>>>>>>> It would definitely be useful for hardware and drivers without SR-IOV >>>>>>>> support so lets hope it is doable. :) >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Tvrtko