Re: Proposal to add CRIU support to DRM render nodes
On 2024-04-16 10:04, Tvrtko Ursulin wrote: > > On 01/04/2024 18:58, Felix Kuehling wrote: >> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: >>> >>> On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: > > On 28/03/2024 20:42, Felix Kuehling wrote: >> >> On 2024-03-28 12:03, Tvrtko Ursulin wrote: >>> >>> Hi Felix, >>> >>> I had one more thought while browsing around the amdgpu CRIU plugin. It >>> appears it relies on the KFD support being compiled in and /dev/kfd >>> present, correct? AFAICT at least, it relies on that to figure out the >>> amdgpu DRM node. >>> >>> In would be probably good to consider designing things without that >>> dependency. So that checkpointing an application which does not use >>> /dev/kfd is possible. Or if the kernel does not even have the KFD >>> support compiled in. >> >> Yeah, if we want to support graphics apps that don't use KFD, we should >> definitely do that. Currently we get a lot of topology information from >> KFD, not even from the /dev/kfd device but from the sysfs nodes exposed >> by KFD. We'd need to get GPU device info from the render nodes instead. >> And if KFD is available, we may need to integrate both sources of >> information. >> >> >>> >>> It could perhaps mean no more than adding some GPU discovery code into >>> CRIU. Which shuold be flexible enough to account for things like >>> re-assigned minor numbers due driver reload. >> >> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I >> was thinking this is still part of the plugin. > > Yes I agree. I was only thinking about adding some DRM device discovery > code in a more decoupled fashion from the current plugin, for both the > reason discussed above (decoupling a bit from reliance on kfd sysfs), and > then also if/when a new DRM driver might want to implement this the code > could be move to some common plugin area. > > I am not sure how feasible that would be though. The "gpu id" concept and > it's matching in the current kernel code and CRIU plugin - is that value > tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. >>> >>> Doesn't it read the saved gpu id from the image file while doing restore >>> and tries to open the render node to match it? Maybe I am misreading the >>> code.. But if it does, does it imply that in practice it could be stable >>> across reboots? Or that it is not possible to restore to a different >>> instance of maybe the same GPU model installed in a system? >> >> Ah, the idea is, that when you restore on a different system, you may get >> different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore >> it on GPU 2 on the same system. That's why we need to translate GPU IDs in >> restored applications. User mode still uses the old GPU IDs, but the kernel >> mode driver translates them to the actual GPU IDs of the GPUs that the >> process was restored on. > > I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but > for restored clients it gets overriden during restore so that any further > ioctls can actually not instantly fail. > > And then in amdgpu_plugin_restore_file, when it is opening the render node, > it relies on the kfd topology to have filled in (more or less) the > target_gpu_id corresponding to the render node gpu id of the target GPU - the > one associated with the new kfd gpu_id? Yes. > > I am digging into this because I am trying to see if some part of GPU > discovery could somehow be decoupled.. to offer you to work on at least that > until you start to tackle the main body of the feature. But it looks properly > tangled up. OK. Most of the interesting plugin code should be in amdgpu_plugin_topology.c. It currently has some pretty complicated logic to find a set of devices that matches the topology in the checkpoint, including shader ISA versions, numbers of compute units, memory sizes, firmware versions and IO-Links between GPUs. This was originally done to support P2P with XGMI links. I'm not sure we ever updated it to properly support PCIe P2P. > > Do you have any suggestions with what I could help with? Maybe developing > some sort of drm device enumeration library if you see a way that would be > useful in decoupling the device discovery from kfd. We would need to define > what sort of information you would need to be queryable from it. Maybe. I think a lot of device information is available with some amd
Re: Proposal to add CRIU support to DRM render nodes
On 01/04/2024 18:58, Felix Kuehling wrote: On 2024-04-01 12:56, Tvrtko Ursulin wrote: On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. Doesn't it read the saved gpu id from the image file while doing restore and tries to open the render node to match it? Maybe I am misreading the code.. But if it does, does it imply that in practice it could be stable across reboots? Or that it is not possible to restore to a different instance of maybe the same GPU model installed in a system? Ah, the idea is, that when you restore on a different system, you may get different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore it on GPU 2 on the same system. That's why we need to translate GPU IDs in restored applications. User mode still uses the old GPU IDs, but the kernel mode driver translates them to the actual GPU IDs of the GPUs that the process was restored on. I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but for restored clients it gets overriden during restore so that any further ioctls can actually not instantly fail. And then in amdgpu_plugin_restore_file, when it is opening the render node, it relies on the kfd topology to have filled in (more or less) the target_gpu_id corresponding to the render node gpu id of the target GPU - the one associated with the new kfd gpu_id? I am digging into this because I am trying to see if some part of GPU discovery could somehow be decoupled.. to offer you to work on at least that until you start to tackle the main body of the feature. But it looks properly tangled up. Do you have any suggestions with what I could help with? Maybe developing some sort of drm device enumeration library if you see a way that would be useful in decoupling the device discovery from kfd. We would need to define what sort of information you would need to be queryable from it. This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. I am not familiar with this. This is not SR-IOV but some other kind of partitioning? Would you have any links where I could read more? Right, the bare-metal driver can partition a PF spatially without SRIOV. SRIOV can also use spatial partitioning and expose each partition through its own VF, but that's not useful for bare metal. Spatial partitioning is new in MI300. There is some high-level info in this whitepaper: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf. From the outside (userspace) this looks simply like multiple DRM render nodes or how exactly? Regards, Tvrtko Regards, Felix
Re: Proposal to add CRIU support to DRM render nodes
On 2024-04-01 12:56, Tvrtko Ursulin wrote: On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. Doesn't it read the saved gpu id from the image file while doing restore and tries to open the render node to match it? Maybe I am misreading the code.. But if it does, does it imply that in practice it could be stable across reboots? Or that it is not possible to restore to a different instance of maybe the same GPU model installed in a system? Ah, the idea is, that when you restore on a different system, you may get different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore it on GPU 2 on the same system. That's why we need to translate GPU IDs in restored applications. User mode still uses the old GPU IDs, but the kernel mode driver translates them to the actual GPU IDs of the GPUs that the process was restored on. This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. I am not familiar with this. This is not SR-IOV but some other kind of partitioning? Would you have any links where I could read more? Right, the bare-metal driver can partition a PF spatially without SRIOV. SRIOV can also use spatial partitioning and expose each partition through its own VF, but that's not useful for bare metal. Spatial partitioning is new in MI300. There is some high-level info in this whitepaper: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf. Regards, Felix Regards, Tvrtko Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regar
Re: Proposal to add CRIU support to DRM render nodes
On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. Doesn't it read the saved gpu id from the image file while doing restore and tries to open the render node to match it? Maybe I am misreading the code.. But if it does, does it imply that in practice it could be stable across reboots? Or that it is not possible to restore to a different instance of maybe the same GPU model installed in a system? This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. I am not familiar with this. This is not SR-IOV but some other kind of partitioning? Would you have any links where I could read more? Regards, Tvrtko Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, thei
Re: Proposal to add CRIU support to DRM render nodes
On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. Regards, Felix Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API det
Re: Proposal to add CRIU support to DRM render nodes
On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind
Re: Proposal to add CRIU support to DRM render nodes
On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a po
Re: Proposal to add CRIU support to DRM render nodes
Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to r
Re: Proposal to add CRIU support to DRM render nodes
On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? In our compute use cases, we suspend user mode queues. This can include CWSR (compute-wave-save-restore) where the state of in-flight waves is stored in memory and can be reloaded and resumed from memory later. We don't use any fences other than "eviction fences", that are signaled after the queues are suspended. And those fences are never handed to user mode. So we don't need to worry about any fence state in the checkpoint. If we extended this to support the kernel mode command submission APIs, I would expect that we'd wait for all current submissions to complete, and stop new ones from being sent to the HW before taking the checkpoint. When we take the checkpoint in the CRIU plugin, the CPU threads are already frozen and cannot submit any more work. If we wait for all currently pendi
Re: Proposal to add CRIU support to DRM render nodes
On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? In our compute use cases, we suspend user mode queues. This can include CWSR (compute-wave-save-restore) where the state of in-flight waves is stored in memory and can be reloaded and resumed from memory later. We don't use any fences other than "eviction fences", that are signaled after the queues are suspended. And those fences are never handed to user mode. So we don't need to worry about any fence state in the checkpoint. If we extended this to support the kernel mode command submission APIs, I would expect that we'd wait for all current submissions to complete, and stop new ones from being sent to the HW before taking the checkpoint. When we take the checkpoint in the CRIU plugin, the CPU threads are already frozen and cannot submit any more work. If we wait for all currently pending submissions to drain, I think we don't nee
Re: Proposal to add CRIU support to DRM render nodes
On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to par
Re: Proposal to add CRIU support to DRM render nodes
Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to parse that information. Therefore, much of the information in our KFD CRIU ioctl API is opaque. It is written by kernel mode in the checkpoint, it is consumed by kernel mode when restoring the checkpoint, but user mode doesn't care about the contents or binary lay
Re: Proposal to add CRIU support to DRM render nodes
I haven't seen any replies to this proposal. Either it got lost in the pre-holiday noise, or there is genuinely no interest in this. If it's the latter, I would look for an AMDGPU driver-specific solution with minimally invasive changes in DRM and DMABuf code, if needed. Maybe it could be generalized later if there is interest then. Regards, Felix On 2023-12-06 16:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to parse that information. Therefore, much of the information in our KFD CRIU ioctl API is opaque. It is written by kernel mode in the checkpoint, it is consumed by kernel mode when restoring the checkpoint, but user mode doesn't care about the contents or binary layout, so there is no user mode ABI to break. This is how we were able to maintain CRIU support when we added the SVM API to KFD without changing the CRIU plugin and without breaking our ABI. Opaque information may also lend itself to API abstraction, if this becomes a generic DRM API with driver-specific callbacks that fill in HW-specific opaque data. # Second API for creating objects Creating BOs and other objects when restoring a checkpoint needs more information than the usual BO alloc and similar APIs provide. For example, we need to restore BOs with the same GEM handles so that user mode can continue using those handles after resuming execution. If BOs are shared through DMABufs without dynamic attachment, we need to restore pinne
Proposal to add CRIU support to DRM render nodes
Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to parse that information. Therefore, much of the information in our KFD CRIU ioctl API is opaque. It is written by kernel mode in the checkpoint, it is consumed by kernel mode when restoring the checkpoint, but user mode doesn't care about the contents or binary layout, so there is no user mode ABI to break. This is how we were able to maintain CRIU support when we added the SVM API to KFD without changing the CRIU plugin and without breaking our ABI. Opaque information may also lend itself to API abstraction, if this becomes a generic DRM API with driver-specific callbacks that fill in HW-specific opaque data. # Second API for creating objects Creating BOs and other objects when restoring a checkpoint needs more information than the usual BO alloc and similar APIs provide. For example, we need to restore BOs with the same GEM handles so that user mode can continue using those handles after resuming execution. If BOs are shared through DMABufs without dynamic attachment, we need to restore pinned BOs as pinned. Validation of virtual addresses and handling MMU notifiers must be suspended until the virtual address space is restored. For user mode queues we need to save and restore a lot of queue execution state so that execution can resume cleanly. # Restoring buffer sharing relationships Different GEM handles in different render nodes and processes can refer to the same underlying shared memory, either b