Re: [PATCH 0/8] AMDGPU usermode queues

2023-02-06 Thread Alex Deucher
On Mon, Feb 6, 2023 at 10:39 AM Michel Dänzer
 wrote:
>
> On 2/3/23 22:54, Shashank Sharma wrote:
> > From: Shashank Sharma 
> >
> > This patch series introduces AMDGPU usermode graphics queues.
> > User queues is a method of GPU workload submission into the graphics
> > hardware without any interaction with kernel/DRM schedulers. In this
> > method, a userspace graphics application can create its own workqueue
> > and submit it directly in the GPU HW.
> >
> > The general idea of how this is supposed to work:
> > - The application creates the following GPU objetcs:
> >   - A queue object to hold the workload packets.
> >   - A read pointer object.
> >   - A write pointer object.
> >   - A doorbell page.
> > - Kernel picks any 32-bit offset in the doorbell page for this queue.
> > - The application uses the usermode_queue_create IOCTL introduced in
> >   this patch, by passing the the GPU addresses of these objects (read
> >   ptr, write ptr, queue base address and doorbell address)
> > - The kernel creates the queue and maps it in the HW.
> > - The application can start submitting the data in the queue as soon as
> >   the kernel IOCTL returns.
> > - Once the data is filled in the queue, the app must write the number of
> >   dwords in the doorbell offset, and the GPU will start fetching the data.
> >
> > libDRM changes for this series and a sample DRM test program can be found
> > in the MESA merge request here:
> > https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/287
>
> I hope everyone's clear these libdrm_amdgpu changes won't be sufficient uAPI 
> validation to allow the kernel bits to be merged upstream.

Right, this is just what we have been using to bring up the feature so far.

Alex

>
> This will require an implementation in the Mesa radeonsi / RADV driver, 
> ideally with working implicit synchronization for BOs shared via dma-buf.
>
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer
>


Re: [PATCH 0/8] AMDGPU usermode queues

2023-02-06 Thread Michel Dänzer
On 2/3/23 22:54, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> This patch series introduces AMDGPU usermode graphics queues.
> User queues is a method of GPU workload submission into the graphics
> hardware without any interaction with kernel/DRM schedulers. In this
> method, a userspace graphics application can create its own workqueue
> and submit it directly in the GPU HW.
> 
> The general idea of how this is supposed to work:
> - The application creates the following GPU objetcs:
>   - A queue object to hold the workload packets.
>   - A read pointer object.
>   - A write pointer object.
>   - A doorbell page.
> - Kernel picks any 32-bit offset in the doorbell page for this queue.
> - The application uses the usermode_queue_create IOCTL introduced in
>   this patch, by passing the the GPU addresses of these objects (read
>   ptr, write ptr, queue base address and doorbell address)
> - The kernel creates the queue and maps it in the HW.
> - The application can start submitting the data in the queue as soon as
>   the kernel IOCTL returns.
> - Once the data is filled in the queue, the app must write the number of
>   dwords in the doorbell offset, and the GPU will start fetching the data.
> 
> libDRM changes for this series and a sample DRM test program can be found
> in the MESA merge request here:
> https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/287

I hope everyone's clear these libdrm_amdgpu changes won't be sufficient uAPI 
validation to allow the kernel bits to be merged upstream.

This will require an implementation in the Mesa radeonsi / RADV driver, ideally 
with working implicit synchronization for BOs shared via dma-buf.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH 0/8] AMDGPU usermode queues

2023-02-06 Thread Christian König

Am 06.02.23 um 01:52 schrieb Dave Airlie:

On Sat, 4 Feb 2023 at 07:54, Shashank Sharma  wrote:

From: Shashank Sharma 

This patch series introduces AMDGPU usermode graphics queues.
User queues is a method of GPU workload submission into the graphics
hardware without any interaction with kernel/DRM schedulers. In this
method, a userspace graphics application can create its own workqueue
and submit it directly in the GPU HW.

The general idea of how this is supposed to work:
- The application creates the following GPU objetcs:
   - A queue object to hold the workload packets.
   - A read pointer object.
   - A write pointer object.
   - A doorbell page.
- Kernel picks any 32-bit offset in the doorbell page for this queue.
- The application uses the usermode_queue_create IOCTL introduced in
   this patch, by passing the the GPU addresses of these objects (read
   ptr, write ptr, queue base address and doorbell address)
- The kernel creates the queue and maps it in the HW.
- The application can start submitting the data in the queue as soon as
   the kernel IOCTL returns.
- Once the data is filled in the queue, the app must write the number of
   dwords in the doorbell offset, and the GPU will start fetching the data.

So I just have one question about forward progress here, let's call it
the 51% of VRAM problem.

You have two apps they both have working sets that allocate > 51% of VRAM.


Marek and I have been working on this quite extensively.


Application (a) has the VRAM and mapping for the user queues and is
submitting work
Application (b) wants to submit work, it has no queue mapping as it
was previously evicted, does (b) have to call an ioctl to get it's
mapping back?


Long story short: No, but that's a bit more complicated to explain.


When (b) calls the ioctl (a) loses it mapping. Control returns to (b),
but before it submits any work on the ring mapping it has, (a) gets
control and notices it has no queues, so calls the ioctl, and (b)
loses it mapping, and around and around they go never making forward
progress.

What's the exit strategy for something like that, fall back to kernel
submit so you can get a memory objects validated and submit some work?


First of all the fw makes sure that processes can only be evicted after 
they used up their time slice. So when you have two processes fighting 
over a shared resource (can be memory, locks or whatever) they will 
always get until the end of their time slice before they are pushed away 
from the hw.


Then when a process is evicted we take a look at what the process has 
already scheduled as work on the hw. If the process isn't idle we start 
a delayed work item to get it going again (similar to what the KFD is 
doing at the moment). When the process is idle we unmap the doorbell 
page(s) from the CPU and wait for the page fault which signals that the 
process wants to submit something again.


And the last component is a static resource management which distributes 
the available resources equally between the different active processes 
fighting over them. Activity of a process is determined by the periodic 
interrupts send by the hw for running processes.


I call the memory management algorithm based on this Robin Hood 
(https://drive.google.com/file/d/1vIrX37c3B2IgWFtZ2UpeKxh0-YMlV6NU/view) 
and simulated that a bit in some spread sheets, but it isn't fully 
implemented yet. I'm working on this for a couple of years now and 
slowly pushing DRM/TTM into the direction we need for this to work.


Christian.



Dave.




Re: [PATCH 0/8] AMDGPU usermode queues

2023-02-05 Thread Dave Airlie
On Sat, 4 Feb 2023 at 07:54, Shashank Sharma  wrote:
>
> From: Shashank Sharma 
>
> This patch series introduces AMDGPU usermode graphics queues.
> User queues is a method of GPU workload submission into the graphics
> hardware without any interaction with kernel/DRM schedulers. In this
> method, a userspace graphics application can create its own workqueue
> and submit it directly in the GPU HW.
>
> The general idea of how this is supposed to work:
> - The application creates the following GPU objetcs:
>   - A queue object to hold the workload packets.
>   - A read pointer object.
>   - A write pointer object.
>   - A doorbell page.
> - Kernel picks any 32-bit offset in the doorbell page for this queue.
> - The application uses the usermode_queue_create IOCTL introduced in
>   this patch, by passing the the GPU addresses of these objects (read
>   ptr, write ptr, queue base address and doorbell address)
> - The kernel creates the queue and maps it in the HW.
> - The application can start submitting the data in the queue as soon as
>   the kernel IOCTL returns.
> - Once the data is filled in the queue, the app must write the number of
>   dwords in the doorbell offset, and the GPU will start fetching the data.

So I just have one question about forward progress here, let's call it
the 51% of VRAM problem.

You have two apps they both have working sets that allocate > 51% of VRAM.

Application (a) has the VRAM and mapping for the user queues and is
submitting work
Application (b) wants to submit work, it has no queue mapping as it
was previously evicted, does (b) have to call an ioctl to get it's
mapping back?
When (b) calls the ioctl (a) loses it mapping. Control returns to (b),
but before it submits any work on the ring mapping it has, (a) gets
control and notices it has no queues, so calls the ioctl, and (b)
loses it mapping, and around and around they go never making forward
progress.

What's the exit strategy for something like that, fall back to kernel
submit so you can get a memory objects validated and submit some work?

Dave.


[PATCH 0/8] AMDGPU usermode queues

2023-02-03 Thread Shashank Sharma
From: Shashank Sharma 

This patch series introduces AMDGPU usermode graphics queues.
User queues is a method of GPU workload submission into the graphics
hardware without any interaction with kernel/DRM schedulers. In this
method, a userspace graphics application can create its own workqueue
and submit it directly in the GPU HW.

The general idea of how this is supposed to work:
- The application creates the following GPU objetcs:
  - A queue object to hold the workload packets.
  - A read pointer object.
  - A write pointer object.
  - A doorbell page.
- Kernel picks any 32-bit offset in the doorbell page for this queue.
- The application uses the usermode_queue_create IOCTL introduced in
  this patch, by passing the the GPU addresses of these objects (read
  ptr, write ptr, queue base address and doorbell address)
- The kernel creates the queue and maps it in the HW.
- The application can start submitting the data in the queue as soon as
  the kernel IOCTL returns.
- Once the data is filled in the queue, the app must write the number of
  dwords in the doorbell offset, and the GPU will start fetching the data.

libDRM changes for this series and a sample DRM test program can be found
in the MESA merge request here:
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/287

The RFC patch series and previous discussion can be seen here:
https://patchwork.freedesktop.org/series/112214/

This patch series needs the doorbell re-design changes, which are being
reviewed here:
https://patchwork.freedesktop.org/series/113669/

In absence of the doorbell patches, this patch series uses a hack patch
to test the functionality. That hack patch is also published here at the
end of the series, just for reference.

Alex Deucher (1):
  drm/amdgpu: UAPI for user queue management

Arvind Yadav (1):
  drm/amdgpu: DO-NOT-MERGE add busy-waiting delay

Shashank Sharma (6):
  drm/amdgpu: add usermode queues
  drm/amdgpu: introduce userqueue MQD handlers
  drm/amdgpu: Add V11 graphics MQD functions
  drm/amdgpu: Create context for usermode queue
  drm/amdgpu: Map userqueue into HW
  drm/amdgpu: DO-NOT-MERGE doorbell hack

 drivers/gpu/drm/amd/amdgpu/Makefile   |   3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c   |   5 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 365 ++
 .../amd/amdgpu/amdgpu_userqueue_mqd_gfx_v11.c | 300 ++
 .../gpu/drm/amd/include/amdgpu_userqueue.h|  93 +
 drivers/gpu/drm/amd/include/v11_structs.h |  16 +-
 include/uapi/drm/amdgpu_drm.h |  59 +++
 9 files changed, 837 insertions(+), 8 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_mqd_gfx_v11.c
 create mode 100644 drivers/gpu/drm/amd/include/amdgpu_userqueue.h

-- 
2.34.1