Re: Proposal to add CRIU support to DRM render nodes

2024-05-03 Thread Felix Kuehling



On 2024-04-16 10:04, Tvrtko Ursulin wrote:
> 
> On 01/04/2024 18:58, Felix Kuehling wrote:
>>
>> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>>>
>>> On 01/04/2024 17:37, Felix Kuehling wrote:
 On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>
> On 28/03/2024 20:42, Felix Kuehling wrote:
>>
>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>
>>> Hi Felix,
>>>
>>> I had one more thought while browsing around the amdgpu CRIU plugin. It 
>>> appears it relies on the KFD support being compiled in and /dev/kfd 
>>> present, correct? AFAICT at least, it relies on that to figure out the 
>>> amdgpu DRM node.
>>>
>>> In would be probably good to consider designing things without that 
>>> dependency. So that checkpointing an application which does not use 
>>> /dev/kfd is possible. Or if the kernel does not even have the KFD 
>>> support compiled in.
>>
>> Yeah, if we want to support graphics apps that don't use KFD, we should 
>> definitely do that. Currently we get a lot of topology information from 
>> KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
>> by KFD. We'd need to get GPU device info from the render nodes instead. 
>> And if KFD is available, we may need to integrate both sources of 
>> information.
>>
>>
>>>
>>> It could perhaps mean no more than adding some GPU discovery code into 
>>> CRIU. Which shuold be flexible enough to account for things like 
>>> re-assigned minor numbers due driver reload.
>>
>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
>> was thinking this is still part of the plugin.
>
> Yes I agree. I was only thinking about adding some DRM device discovery 
> code in a more decoupled fashion from the current plugin, for both the 
> reason discussed above (decoupling a bit from reliance on kfd sysfs), and 
> then also if/when a new DRM driver might want to implement this the code 
> could be move to some common plugin area.
>
> I am not sure how feasible that would be though. The "gpu id" concept and 
> it's matching in the current kernel code and CRIU plugin - is that value 
> tied to the physical GPU instance or how it works?

 The concept of the GPU ID is that it's stable while the system is up, even 
 when devices get added and removed dynamically. It was baked into the API 
 early on, but I don't think we ever fully validated device hot plug. I 
 think the closest we're getting is with our latest MI GPUs and dynamic 
 partition mode change.
>>>
>>> Doesn't it read the saved gpu id from the image file while doing restore 
>>> and tries to open the render node to match it? Maybe I am misreading the 
>>> code.. But if it does, does it imply that in practice it could be stable 
>>> across reboots? Or that it is not possible to restore to a different 
>>> instance of maybe the same GPU model installed in a system?
>>
>> Ah, the idea is, that when you restore on a different system, you may get 
>> different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore 
>> it on GPU 2 on the same system. That's why we need to translate GPU IDs in 
>> restored applications. User mode still uses the old GPU IDs, but the kernel 
>> mode driver translates them to the actual GPU IDs of the GPUs that the 
>> process was restored on.
> 
> I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but 
> for restored clients it gets overriden during restore so that any further 
> ioctls can actually not instantly fail.
> 
> And then in amdgpu_plugin_restore_file, when it is opening the render node, 
> it relies on the kfd topology to have filled in (more or less) the 
> target_gpu_id corresponding to the render node gpu id of the target GPU - the 
> one associated with the new kfd gpu_id?

Yes.

> 
> I am digging into this because I am trying to see if some part of GPU 
> discovery could somehow be decoupled.. to offer you to work on at least that 
> until you start to tackle the main body of the feature. But it looks properly 
> tangled up.

OK. Most of the interesting plugin code should be in amdgpu_plugin_topology.c. 
It currently has some pretty complicated logic to find a set of devices that 
matches the topology in the checkpoint, including shader ISA versions, numbers 
of compute units, memory sizes, firmware versions and IO-Links between GPUs. 
This was originally done to support P2P with XGMI links. I'm not sure we ever 
updated it to properly support PCIe P2P.


> 
> Do you have any suggestions with what I could help with? Maybe developing 
> some sort of drm device enumeration library if you see a way that would be 
> useful in decoupling the device discovery from kfd. We would need to define 
> what sort of information you would need to be queryable from it.

Maybe. I think a lot of device information is available with some 

Re: Proposal to add CRIU support to DRM render nodes

2024-04-16 Thread Tvrtko Ursulin



On 01/04/2024 18:58, Felix Kuehling wrote:


On 2024-04-01 12:56, Tvrtko Ursulin wrote:


On 01/04/2024 17:37, Felix Kuehling wrote:

On 2024-04-01 11:09, Tvrtko Ursulin wrote:


On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU 
plugin. It appears it relies on the KFD support being compiled in 
and /dev/kfd present, correct? AFAICT at least, it relies on that 
to figure out the amdgpu DRM node.


In would be probably good to consider designing things without 
that dependency. So that checkpointing an application which does 
not use /dev/kfd is possible. Or if the kernel does not even have 
the KFD support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from 
the sysfs nodes exposed by KFD. We'd need to get GPU device info 
from the render nodes instead. And if KFD is available, we may need 
to integrate both sources of information.





It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the 
plugin. I was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance 
on kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" 
concept and it's matching in the current kernel code and CRIU plugin 
- is that value tied to the physical GPU instance or how it works?


The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked 
into the API early on, but I don't think we ever fully validated 
device hot plug. I think the closest we're getting is with our latest 
MI GPUs and dynamic partition mode change.


Doesn't it read the saved gpu id from the image file while doing 
restore and tries to open the render node to match it? Maybe I am 
misreading the code.. But if it does, does it imply that in practice 
it could be stable across reboots? Or that it is not possible to 
restore to a different instance of maybe the same GPU model installed 
in a system?


Ah, the idea is, that when you restore on a different system, you may 
get different GPU IDs. Or you may checkpoint an app running on GPU 1 but 
restore it on GPU 2 on the same system. That's why we need to translate 
GPU IDs in restored applications. User mode still uses the old GPU IDs, 
but the kernel mode driver translates them to the actual GPU IDs of the 
GPUs that the process was restored on.


I see.. I think. Normal flow is ppd->user_gpu_id set during client init, 
but for restored clients it gets overriden during restore so that any 
further ioctls can actually not instantly fail.


And then in amdgpu_plugin_restore_file, when it is opening the render 
node, it relies on the kfd topology to have filled in (more or less) the 
target_gpu_id corresponding to the render node gpu id of the target GPU 
- the one associated with the new kfd gpu_id?


I am digging into this because I am trying to see if some part of GPU 
discovery could somehow be decoupled.. to offer you to work on at least 
that until you start to tackle the main body of the feature. But it 
looks properly tangled up.


Do you have any suggestions with what I could help with? Maybe 
developing some sort of drm device enumeration library if you see a way 
that would be useful in decoupling the device discovery from kfd. We 
would need to define what sort of information you would need to be 
queryable from it.


This also highlights another aspect on those spatially partitioned 
GPUs. GPU IDs identify device partitions, not devices. Similarly, 
each partition has its own render node, and the KFD topology info in 
sysfs points to the render-minor number corresponding to each GPU ID.


I am not familiar with this. This is not SR-IOV but some other kind of 
partitioning? Would you have any links where I could read more?


Right, the bare-metal driver can partition a PF spatially without SRIOV. 
SRIOV can also use spatial partitioning and expose each partition 
through its own VF, but that's not useful for bare metal. Spatial 
partitioning is new in MI300. There is some high-level info in this 
whitepaper: 
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.


From the outside (userspace) this looks simply like multiple DRM render 
nodes or how exactly?


Regards,

Tvrtko



Regards,
   Felix



Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling



On 2024-04-01 12:56, Tvrtko Ursulin wrote:


On 01/04/2024 17:37, Felix Kuehling wrote:

On 2024-04-01 11:09, Tvrtko Ursulin wrote:


On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU 
plugin. It appears it relies on the KFD support being compiled in 
and /dev/kfd present, correct? AFAICT at least, it relies on that 
to figure out the amdgpu DRM node.


In would be probably good to consider designing things without 
that dependency. So that checkpointing an application which does 
not use /dev/kfd is possible. Or if the kernel does not even have 
the KFD support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from 
the sysfs nodes exposed by KFD. We'd need to get GPU device info 
from the render nodes instead. And if KFD is available, we may need 
to integrate both sources of information.





It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the 
plugin. I was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance 
on kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" 
concept and it's matching in the current kernel code and CRIU plugin 
- is that value tied to the physical GPU instance or how it works?


The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked 
into the API early on, but I don't think we ever fully validated 
device hot plug. I think the closest we're getting is with our latest 
MI GPUs and dynamic partition mode change.


Doesn't it read the saved gpu id from the image file while doing 
restore and tries to open the render node to match it? Maybe I am 
misreading the code.. But if it does, does it imply that in practice 
it could be stable across reboots? Or that it is not possible to 
restore to a different instance of maybe the same GPU model installed 
in a system?


Ah, the idea is, that when you restore on a different system, you may 
get different GPU IDs. Or you may checkpoint an app running on GPU 1 but 
restore it on GPU 2 on the same system. That's why we need to translate 
GPU IDs in restored applications. User mode still uses the old GPU IDs, 
but the kernel mode driver translates them to the actual GPU IDs of the 
GPUs that the process was restored on.





This also highlights another aspect on those spatially partitioned 
GPUs. GPU IDs identify device partitions, not devices. Similarly, 
each partition has its own render node, and the KFD topology info in 
sysfs points to the render-minor number corresponding to each GPU ID.


I am not familiar with this. This is not SR-IOV but some other kind of 
partitioning? Would you have any links where I could read more?


Right, the bare-metal driver can partition a PF spatially without SRIOV. 
SRIOV can also use spatial partitioning and expose each partition 
through its own VF, but that's not useful for bare metal. Spatial 
partitioning is new in MI300. There is some high-level info in this 
whitepaper: 
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.


Regards,
  Felix




Regards,

Tvrtko

Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend 
to other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to 
work on this more, but it's probably going to be at least a few weeks.


Got it.

Regards,

Tvrtko



Regards,
   Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application 
once they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why 
we are doing this, and outlining some of the problems we need 
to solve to checkpoint and restore render node state and 
shared memory (DMABuf) state. I have some thoughts on the API 
design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community 

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Tvrtko Ursulin



On 01/04/2024 17:37, Felix Kuehling wrote:

On 2024-04-01 11:09, Tvrtko Ursulin wrote:


On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and 
/dev/kfd present, correct? AFAICT at least, it relies on that to 
figure out the amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from the 
sysfs nodes exposed by KFD. We'd need to get GPU device info from the 
render nodes instead. And if KFD is available, we may need to 
integrate both sources of information.





It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. 
I was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance on 
kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" concept 
and it's matching in the current kernel code and CRIU plugin - is that 
value tied to the physical GPU instance or how it works?


The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked into 
the API early on, but I don't think we ever fully validated device hot 
plug. I think the closest we're getting is with our latest MI GPUs and 
dynamic partition mode change.


Doesn't it read the saved gpu id from the image file while doing restore 
and tries to open the render node to match it? Maybe I am misreading the 
code.. But if it does, does it imply that in practice it could be stable 
across reboots? Or that it is not possible to restore to a different 
instance of maybe the same GPU model installed in a system?


This also highlights another aspect on those spatially partitioned GPUs. 
GPU IDs identify device partitions, not devices. Similarly, each 
partition has its own render node, and the KFD topology info in sysfs 
points to the render-minor number corresponding to each GPU ID.


I am not familiar with this. This is not SR-IOV but some other kind of 
partitioning? Would you have any links where I could read more?


Regards,

Tvrtko

Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to 
work on this more, but it's probably going to be at least a few weeks.


Got it.

Regards,

Tvrtko



Regards,
   Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application 
once they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual 
address management, that creates a problem for checkpointing 
and restoring ROCm applications with CRIU. Currently there is 
no support for checkpointing and restoring render node state, 
other than CPU virtual address mappings. Support will be needed 
for checkpointing GEM buffer objects and handles, 

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling

On 2024-04-01 11:09, Tvrtko Ursulin wrote:


On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and 
/dev/kfd present, correct? AFAICT at least, it relies on that to 
figure out the amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from the 
sysfs nodes exposed by KFD. We'd need to get GPU device info from the 
render nodes instead. And if KFD is available, we may need to 
integrate both sources of information.





It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. 
I was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance on 
kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" concept 
and it's matching in the current kernel code and CRIU plugin - is that 
value tied to the physical GPU instance or how it works?


The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked into 
the API early on, but I don't think we ever fully validated device hot 
plug. I think the closest we're getting is with our latest MI GPUs and 
dynamic partition mode change.


This also highlights another aspect on those spatially partitioned GPUs. 
GPU IDs identify device partitions, not devices. Similarly, each 
partition has its own render node, and the KFD topology info in sysfs 
points to the render-minor number corresponding to each GPU ID.


Regards,
  Felix




Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to 
work on this more, but it's probably going to be at least a few weeks.


Got it.

Regards,

Tvrtko



Regards,
   Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application 
once they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual 
address management, that creates a problem for checkpointing 
and restoring ROCm applications with CRIU. Currently there is 
no support for checkpointing and restoring render node state, 
other than CPU virtual address mappings. Support will be needed 
for checkpointing GEM buffer objects and handles, their GPU 
virtual address mappings and memory sharing relationships 
between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including 
scheduler contexts and BO lists. Most of this state is 
driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf 
APIs and may have implications for other drivers in the future.


One basic question before going into any API 

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Tvrtko Ursulin



On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and /dev/kfd 
present, correct? AFAICT at least, it relies on that to figure out the 
amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we should 
definitely do that. Currently we get a lot of topology information from 
KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
by KFD. We'd need to get GPU device info from the render nodes instead. 
And if KFD is available, we may need to integrate both sources of 
information.





It could perhaps mean no more than adding some GPU discovery code into 
CRIU. Which shuold be flexible enough to account for things like 
re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device discovery 
code in a more decoupled fashion from the current plugin, for both the 
reason discussed above (decoupling a bit from reliance on kfd sysfs), 
and then also if/when a new DRM driver might want to implement this the 
code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" concept 
and it's matching in the current kernel code and CRIU plugin - is that 
value tied to the physical GPU instance or how it works?


Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to work 
on this more, but it's probably going to be at least a few weeks.


Got it.

Regards,

Tvrtko



Regards,
   Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application once 
they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and 
restoring ROCm applications with CRIU. Currently there is no 
support for checkpointing and restoring render node state, other 
than CPU virtual address mappings. Support will be needed for 
checkpointing GEM buffer objects and handles, their GPU virtual 
address mappings and memory sharing relationships between devices 
and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including 
scheduler contexts and BO lists. Most of this state is 
driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf 
APIs and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, 
although I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of 
any concrete plans, but I think feature is pretty cool and if 
amdgpu gets it working I wouldn't be surprised if other drivers 
would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I 
recently implemented on Mesa's request, which is to be able to 
"upload" the GPU context from the GPU hang error state and replay 
the hanging request. It is kind 

Re: Proposal to add CRIU support to DRM render nodes

2024-03-28 Thread Felix Kuehling



On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and /dev/kfd 
present, correct? AFAICT at least, it relies on that to figure out the 
amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we should 
definitely do that. Currently we get a lot of topology information from 
KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
by KFD. We'd need to get GPU device info from the render nodes instead. 
And if KFD is available, we may need to integrate both sources of 
information.





It could perhaps mean no more than adding some GPU discovery code into 
CRIU. Which shuold be flexible enough to account for things like 
re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
was thinking this is still part of the plugin.





Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to work 
on this more, but it's probably going to be at least a few weeks.


Regards,
  Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application once 
they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and 
restoring ROCm applications with CRIU. Currently there is no 
support for checkpointing and restoring render node state, other 
than CPU virtual address mappings. Support will be needed for 
checkpointing GEM buffer objects and handles, their GPU virtual 
address mappings and memory sharing relationships between devices 
and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including 
scheduler contexts and BO lists. Most of this state is 
driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf 
APIs and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, 
although I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of 
any concrete plans, but I think feature is pretty cool and if 
amdgpu gets it working I wouldn't be surprised if other drivers 
would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I 
recently implemented on Mesa's request, which is to be able to 
"upload" the GPU context from the GPU hang error state and replay 
the hanging request. It is kind of (at a stretch) a very special 
tiny subset of checkout and restore so I am not mentioning it as a 
curiosity.


And there is also another partical conceptual intersect with the 
(at the moment not yet upstream) i915 online debugger. This part 
being in the area of discovering and enumerating GPU resources 
beloning to the client.


I don't see an immediate design or code sharing opportunities 
though but just mentioning.


I did spend some time reading your plugin and kernel 
implementation out of curiousity and have some comments and 
questions.


With that out of the way, some considerations for a 

Re: Proposal to add CRIU support to DRM render nodes

2024-03-28 Thread Tvrtko Ursulin



Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. It 
appears it relies on the KFD support being compiled in and /dev/kfd 
present, correct? AFAICT at least, it relies on that to figure out the 
amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


It could perhaps mean no more than adding some GPU discovery code into 
CRIU. Which shuold be flexible enough to account for things like 
re-assigned minor numbers due driver reload.


Otherwise I am eagerly awaiting to hear more about the design specifics 
around dma-buf handling. And also seeing how to extend to other DRM 
related anonymous fds.


Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes 
in order to maintain CRIU support for ROCm application once they 
start relying on render nodes for more GPU memory management. In 
this email I'm providing some background why we are doing this, and 
outlining some of the problems we need to solve to checkpoint and 
restore render node state and shared memory (DMABuf) state. I have 
some thoughts on the API design, leaning on what we did for KFD, 
but would like to get feedback from the DRI community regarding 
that API and to what extent there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM API 
and improve interoperability between graphics and compute. This 
uses DMABufs for sharing buffer objects between KFD and multiple 
render node devices, as well as between processes. In the long run 
this also provides a path to moving all or most memory management 
from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU 
virtual address mappings. Support will be needed for checkpointing 
GEM buffer objects and handles, their GPU virtual address mappings 
and memory sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including scheduler 
contexts and BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf APIs 
and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although 
I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets 
it working I wouldn't be surprised if other drivers would get 
interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the 
GPU context from the GPU hang error state and replay the hanging 
request. It is kind of (at a stretch) a very special tiny subset of 
checkout and restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at 
the moment not yet upstream) i915 online debugger. This part being 
in the area of discovering and enumerating GPU resources beloning to 
the client.


I don't see an immediate design or code sharing opportunities though 
but just mentioning.


I did spend some time reading your plugin and kernel implementation 
out of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM 
CRIU API (either generic of AMDGPU driver specific): The API goes 
through several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can 
allocate

    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at 
this time)
 2. Resume (final fixups after the VMAs are sorted out, resume 
execution)


Btw is check-pointing guaranteeing all relevant activity is idled? 
For instance dma_resv objects are free of fences which would need to 

Re: Proposal to add CRIU support to DRM render nodes

2024-03-15 Thread Tvrtko Ursulin



On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes 
in order to maintain CRIU support for ROCm application once they 
start relying on render nodes for more GPU memory management. In 
this email I'm providing some background why we are doing this, and 
outlining some of the problems we need to solve to checkpoint and 
restore render node state and shared memory (DMABuf) state. I have 
some thoughts on the API design, leaning on what we did for KFD, but 
would like to get feedback from the DRI community regarding that API 
and to what extent there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM API 
and improve interoperability between graphics and compute. This uses 
DMABufs for sharing buffer objects between KFD and multiple render 
node devices, as well as between processes. In the long run this 
also provides a path to moving all or most memory management from 
the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU 
virtual address mappings. Support will be needed for checkpointing 
GEM buffer objects and handles, their GPU virtual address mappings 
and memory sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including scheduler 
contexts and BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf APIs 
and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although 
I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets 
it working I wouldn't be surprised if other drivers would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the 
GPU context from the GPU hang error state and replay the hanging 
request. It is kind of (at a stretch) a very special tiny subset of 
checkout and restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at 
the moment not yet upstream) i915 online debugger. This part being in 
the area of discovering and enumerating GPU resources beloning to the 
client.


I don't see an immediate design or code sharing opportunities though 
but just mentioning.


I did spend some time reading your plugin and kernel implementation 
out of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM 
CRIU API (either generic of AMDGPU driver specific): The API goes 
through several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can 
allocate

    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at 
this time)
 2. Resume (final fixups after the VMAs are sorted out, resume 
execution)


Btw is check-pointing guaranteeing all relevant activity is idled? 
For instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that 
handled?


In our compute use cases, we suspend user mode queues. This can include 
CWSR (compute-wave-save-restore) where the state of in-flight waves is 
stored in memory and can be reloaded and resumed from memory later. We 
don't use any fences other than "eviction fences", that are signaled 
after the queues are suspended. And those fences are never handed to 
user mode. So we don't need to worry about any fence state in the 
checkpoint.


If we extended this to support the kernel mode command submission APIs, 
I would expect that we'd wait for all current submissions to complete, 
and stop new ones from being sent to the HW before taking the 
checkpoint. When we take the checkpoint in the CRIU plugin, the CPU 
threads are already frozen and cannot submit any more work. If we wait 
for all currently 

Re: Proposal to add CRIU support to DRM render nodes

2024-03-14 Thread Felix Kuehling



On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes 
in order to maintain CRIU support for ROCm application once they 
start relying on render nodes for more GPU memory management. In 
this email I'm providing some background why we are doing this, and 
outlining some of the problems we need to solve to checkpoint and 
restore render node state and shared memory (DMABuf) state. I have 
some thoughts on the API design, leaning on what we did for KFD, but 
would like to get feedback from the DRI community regarding that API 
and to what extent there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM API 
and improve interoperability between graphics and compute. This uses 
DMABufs for sharing buffer objects between KFD and multiple render 
node devices, as well as between processes. In the long run this 
also provides a path to moving all or most memory management from 
the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU 
virtual address mappings. Support will be needed for checkpointing 
GEM buffer objects and handles, their GPU virtual address mappings 
and memory sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including scheduler 
contexts and BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf APIs 
and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although 
I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets 
it working I wouldn't be surprised if other drivers would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the 
GPU context from the GPU hang error state and replay the hanging 
request. It is kind of (at a stretch) a very special tiny subset of 
checkout and restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at 
the moment not yet upstream) i915 online debugger. This part being in 
the area of discovering and enumerating GPU resources beloning to the 
client.


I don't see an immediate design or code sharing opportunities though 
but just mentioning.


I did spend some time reading your plugin and kernel implementation 
out of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM 
CRIU API (either generic of AMDGPU driver specific): The API goes 
through several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can 
allocate

    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at 
this time)
 2. Resume (final fixups after the VMAs are sorted out, resume 
execution)


Btw is check-pointing guaranteeing all relevant activity is idled? 
For instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that 
handled?


In our compute use cases, we suspend user mode queues. This can include 
CWSR (compute-wave-save-restore) where the state of in-flight waves is 
stored in memory and can be reloaded and resumed from memory later. We 
don't use any fences other than "eviction fences", that are signaled 
after the queues are suspended. And those fences are never handed to 
user mode. So we don't need to worry about any fence state in the 
checkpoint.


If we extended this to support the kernel mode command submission APIs, 
I would expect that we'd wait for all current submissions to complete, 
and stop new ones from being sent to the HW before taking the 
checkpoint. When we take the checkpoint in the CRIU plugin, the CPU 
threads are already frozen and cannot submit any more work. If we wait 
for all currently pending submissions to drain, I think we don't 

Re: Proposal to add CRIU support to DRM render nodes

2024-03-12 Thread Tvrtko Ursulin



On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining 
some of the problems we need to solve to checkpoint and restore render 
node state and shared memory (DMABuf) state. I have some thoughts on 
the API design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address mappings 
in ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, 
as well as between processes. In the long run this also provides a 
path to moving all or most memory management from the KFD ioctl API to 
libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU virtual 
address mappings. Support will be needed for checkpointing GEM buffer 
objects and handles, their GPU virtual address mappings and memory 
sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may 
have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although I 
cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets it 
working I wouldn't be surprised if other drivers would get interested.


Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the GPU 
context from the GPU hang error state and replay the hanging request. It 
is kind of (at a stretch) a very special tiny subset of checkout and 
restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at the 
moment not yet upstream) i915 online debugger. This part being in the 
area of discovering and enumerating GPU resources beloning to the client.


I don't see an immediate design or code sharing opportunities though but 
just mentioning.


I did spend some time reading your plugin and kernel implementation out 
of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can allocate
    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this 
time)

 2. Resume (final fixups after the VMAs are sorted out, resume execution)


Btw is check-pointing guaranteeing all relevant activity is idled? For 
instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that handled?


For some more background about our implementation in KFD, you can 
refer to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


  * Opaque information in the checkpoint data that user mode can't
    interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
    the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
    relationships rather than replaying BO creation, export and import
    from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to 

Re: Proposal to add CRIU support to DRM render nodes

2024-03-11 Thread Tvrtko Ursulin



Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining some 
of the problems we need to solve to checkpoint and restore render node 
state and shared memory (DMABuf) state. I have some thoughts on the API 
design, leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent there is 
interest in making that generic.


We are working on using DRM render nodes for virtual address mappings in 
ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, as 
well as between processes. In the long run this also provides a path to 
moving all or most memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring ROCm 
applications with CRIU. Currently there is no support for checkpointing 
and restoring render node state, other than CPU virtual address 
mappings. Support will be needed for checkpointing GEM buffer objects 
and handles, their GPU virtual address mappings and memory sharing 
relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may have 
implications for other drivers in the future.


One basic question before going into any API details: Is there a desire 
to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although I 
cannot answer on the last question here.


Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the GPU 
context from the GPU hang error state and replay the hanging request. It 
is kind of (at a stretch) a very special tiny subset of checkout and 
restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at the 
moment not yet upstream) i915 online debugger. This part being in the 
area of discovering and enumerating GPU resources beloning to the client.


I don't see an immediate design or code sharing opportunities though but 
just mentioning.


I did spend some time reading your plugin and kernel implementation out 
of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can allocate
memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this time)
 2. Resume (final fixups after the VMAs are sorted out, resume execution)


Btw is check-pointing guaranteeing all relevant activity is idled? For 
instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that handled?


For some more background about our implementation in KFD, you can refer 
to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


  * Opaque information in the checkpoint data that user mode can't
interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and import
from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to parse that information. 
Therefore, much of the information in our KFD CRIU ioctl API is opaque. 
It is written by kernel mode in the checkpoint, it is consumed by kernel 
mode when restoring the checkpoint, but user mode doesn't care about the 
contents or binary 

Re: Proposal to add CRIU support to DRM render nodes

2024-01-15 Thread Felix Kuehling
I haven't seen any replies to this proposal. Either it got lost in the 
pre-holiday noise, or there is genuinely no interest in this.


If it's the latter, I would look for an AMDGPU driver-specific solution 
with minimally invasive changes in DRM and DMABuf code, if needed. Maybe 
it could be generalized later if there is interest then.


Regards,
  Felix


On 2023-12-06 16:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining 
some of the problems we need to solve to checkpoint and restore render 
node state and shared memory (DMABuf) state. I have some thoughts on 
the API design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address mappings 
in ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, 
as well as between processes. In the long run this also provides a 
path to moving all or most memory management from the KFD ioctl API to 
libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU virtual 
address mappings. Support will be needed for checkpointing GEM buffer 
objects and handles, their GPU virtual address mappings and memory 
sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may 
have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can
allocate memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this
time)
 2. Resume (final fixups after the VMAs are sorted out, resume execution)

For some more background about our implementation in KFD, you can 
refer to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


  * Opaque information in the checkpoint data that user mode can't
interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and import
from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to parse that 
information. Therefore, much of the information in our KFD CRIU ioctl 
API is opaque. It is written by kernel mode in the checkpoint, it is 
consumed by kernel mode when restoring the checkpoint, but user mode 
doesn't care about the contents or binary layout, so there is no user 
mode ABI to break. This is how we were able to maintain CRIU support 
when we added the SVM API to KFD without changing the CRIU plugin and 
without breaking our ABI.


Opaque information may also lend itself to API abstraction, if this 
becomes a generic DRM API with driver-specific callbacks that fill in 
HW-specific opaque data.


# Second API for creating objects

Creating BOs and other objects when restoring a checkpoint needs more 
information than the usual BO alloc and similar APIs provide. For 
example, we need to restore BOs with the same GEM handles so that user 
mode can continue using those handles after resuming execution. If BOs 
are shared through DMABufs without dynamic attachment, we need to 
restore