I am sending this early with the main goal of getting a feel on how it will be received etc.
Cover letter is written mostly for people who already know what checkpoint and restore is, so I will just restate the high level goal - To be able to checkpoint and restore purely a rendernode process (so no kfd in the picture). And for people not too familiar what it is, probably best to first read about it here: https://criu.org/Main_Page. There are three pieces of work here: 1. This kernel series which adds new uapi to amdgpu. 2. A new IGT test case which helped me find what doesn't work and to verify what I added. https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/log/?h=amd-criu The IGT starts out adding some very basic tests (first commit) which are laid out in order of increasing complexity. Following commits then add more tests and by the end, with all these series combined, they all pass apart from the forking subtest. But that one I am leaving out of scope for now. (And to be clear, before this work, none of the tests cases can pass.) 3. Changes to the amdgpu CRIU plugin which use the above uapi, among other changes. https://github.com/tursulin/criu/pull/new/amdgpu-render-node-rfc As a picture is worth a thousand words the best I can do is some from terminal pastes showing it all in action. Terminal 1: $ sudo ~/build-holo/tests/amdgpu/amd_criu --r busy-client-content IGT-Version: 2.3-ge37a85b91 (x86_64) (Linux: 7.1.0-rc2-cfs x86_64) Using IGT_SRANDOM=1779805687 for randomisation Opened device: /dev/dri/renderD128 Starting subtest: busy-client-content Start checkpointing within 10 seconds... Now switch to terminal 2: $ sudo /usr/local/sbin/criu dump -t `pgrep amd_criu | head -1` \ -L /usr/local/lib/criu/ -vvv -o criu.log -j --link-remap --tcp-established \ --file-locks --ext-unix-sk Back to terminal 1: ... Killed # This is normal - CRIU dump has saved and terminated the process Back to terminal 2, lets restore it: $ sudo /usr/local/sbin/criu restore -L /usr/local/lib/criu/ -vvv \ -o restore.log --shell-job --link-remap --tcp-established --file-locks --ext-unix-sk Subtest busy-client-content: SUCCESS (10.739s) And that is it. Client which was busy looping submitting a SDMA_NOP IB was successfuly checkpointed and restored to completion. It both ran for the remainder of the indented duration and we checked buffer content was as expected at the end. There is definitely more to do. Play more with exported buffers, syncobjs, fences, buffer objects lists, but for now, as said, I am looking for some early feedback etc. Tvrtko Ursulin (5): drm/amdgpu: Extend listing of buffer handles with the userptr object flag drm/amdgpu: Add a reserved VM ID query drm/amdgpu: Add a new ioctl for listing client contexts drm/amdgpu: Add context handle renaming operation drm/amdgpu: Add driver managed buffer copy drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 132 ++++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 120 +++++++++++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 54 ++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 26 ++++- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 + include/uapi/drm/amdgpu_drm.h | 54 +++++++++- 8 files changed, 365 insertions(+), 31 deletions(-) -- 2.54.0
