I am sending this early with the main goal of getting a feel on how it will be
received etc.

Cover letter is written mostly for people who already know what checkpoint and
restore is, so I will just restate the high level goal - To be able to
checkpoint and restore purely a rendernode process (so no kfd in the picture).

And for people not too familiar what it is, probably best to first read about it
here: https://criu.org/Main_Page.

There are three pieces of work here:

1.
This kernel series which adds new uapi to amdgpu.

2.
A new IGT test case which helped me find what doesn't work and to verify what I
added.

https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/log/?h=amd-criu

The IGT starts out adding some very basic tests (first commit) which are laid
out in order of increasing complexity.

Following commits then add more tests and by the end, with all these series
combined, they all pass apart from the forking subtest. But that one I am
leaving out of scope for now.

(And to be clear, before this work, none of the tests cases can pass.)

3.
Changes to the amdgpu CRIU plugin which use the above uapi, among other changes.

https://github.com/tursulin/criu/pull/new/amdgpu-render-node-rfc

As a picture is worth a thousand words the best I can do is some from terminal
pastes showing it all in action.

Terminal 1:

$ sudo ~/build-holo/tests/amdgpu/amd_criu --r busy-client-content
IGT-Version: 2.3-ge37a85b91 (x86_64) (Linux: 7.1.0-rc2-cfs x86_64)
Using IGT_SRANDOM=1779805687 for randomisation
Opened device: /dev/dri/renderD128
Starting subtest: busy-client-content
Start checkpointing within 10 seconds...

Now switch to terminal 2:

$ sudo /usr/local/sbin/criu dump -t `pgrep amd_criu | head -1` \
  -L /usr/local/lib/criu/ -vvv -o criu.log -j --link-remap --tcp-established \
  --file-locks --ext-unix-sk

Back to terminal 1:

...
Killed # This is normal - CRIU dump has saved and terminated the process

Back to terminal 2, lets restore it:

$ sudo /usr/local/sbin/criu restore  -L /usr/local/lib/criu/ -vvv \
  -o restore.log --shell-job --link-remap --tcp-established --file-locks 
  --ext-unix-sk
Subtest busy-client-content: SUCCESS (10.739s)

And that is it. Client which was busy looping submitting a SDMA_NOP IB was
successfuly checkpointed and restored to completion. It both ran for the
remainder of the indented duration and we checked buffer content was as
expected at the end.

There is definitely more to do. Play more with exported buffers, syncobjs,
fences, buffer objects lists, but for now, as said, I am looking for some early
feedback etc.

Tvrtko Ursulin (5):
  drm/amdgpu: Extend listing of buffer handles with the userptr object
    flag
  drm/amdgpu: Add a reserved VM ID query
  drm/amdgpu: Add a new ioctl for listing client contexts
  drm/amdgpu: Add context handle renaming operation
  drm/amdgpu: Add driver managed buffer copy

 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 132 ++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 120 +++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.h |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  54 ++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h |  26 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  |   4 +
 include/uapi/drm/amdgpu_drm.h           |  54 +++++++++-
 8 files changed, 365 insertions(+), 31 deletions(-)

-- 
2.54.0

Reply via email to