Re: [PATCH] Documentation: add initial documenation for user queues

Rodrigo Siqueira Mon, 22 Sep 2025 11:52:55 -0700

On 09/15, Alex Deucher wrote:
> Add an initial documentation page for user mode queues.
> 
> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
> ---
>  Documentation/gpu/amdgpu/index.rst |   1 +
>  Documentation/gpu/amdgpu/userq.rst | 203 +++++++++++++++++++++++++++++
>  2 files changed, 204 insertions(+)
>  create mode 100644 Documentation/gpu/amdgpu/userq.rst
> 
> diff --git a/Documentation/gpu/amdgpu/index.rst 
> b/Documentation/gpu/amdgpu/index.rst
> index bb2894b5edaf2..45523e9860fc5 100644
> --- a/Documentation/gpu/amdgpu/index.rst
> +++ b/Documentation/gpu/amdgpu/index.rst
> @@ -12,6 +12,7 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) 
> architectures.
>     module-parameters
>     gc/index
>     display/index
> +   userq
>     flashing
>     xgmi
>     ras
> diff --git a/Documentation/gpu/amdgpu/userq.rst 
> b/Documentation/gpu/amdgpu/userq.rst
> new file mode 100644
> index 0000000000000..ca3ea71f7888b
> --- /dev/null
> +++ b/Documentation/gpu/amdgpu/userq.rst
> @@ -0,0 +1,203 @@
> +==================
> + User Mode Queues
> +==================
> +
> +Introduction
> +============
> +
> +Similar to the KFD, GPU engine queues move into userspace.  The idea is to 
> let
> +user processes manage their submissions to the GPU engines directly, 
> bypassing
> +IOCTL calls to the driver to submit work.  This reduces overhead and also 
> allows
> +the GPU to submit work to itself.  Applications can set up work graphs of 
> jobs
> +across multiple GPU engines without needing trips through the CPU.
> +
> +UMDs directly interface with firmware via per application shared memory 
> areas.
> +The main vehicle for this is queue.  A queue is a ring buffer with a read
> +pointer (rptr) and a write pointer (wptr).  The UMD writes IP specific 
> packets
> +into the queue and the firmware processes those packets, kicking off work on 
> the
> +GPU engines.  The CPU in the application (or another queue or device) updates
> +the wptr to tell the firmware how far into the ring buffer to process packets
> +and the rtpr provides feedback to the UMD on how far the firmware has 
> progressed
> +in executing those packets.  When the wptr and the rptr are equal, the queue 
> is
> +idle.
> +
> +Theory of Operation
> +===================
> +
> +The various engines on modern AMD GPUs support multiple queues per engine 
> with a
> +scheduling firmware which handles dynamically scheduling user queues on the
> +available hardware queue slots.  When the number of user queues outnumbers 
> the
> +available hardware queue slots, the scheduling firmware dynamically maps and
> +unmaps queues based on priority and time quanta.  The state of each user 
> queue
> +is managed in the kernel driver in an MQD (Memory Queue Descriptor).  This 
> is a
> +buffer in GPU accessible memory that stores the state of a user queue.  The
> +scheduling firmware uses the MQD to load the queue state into an HQD 
> (Hardware
> +Queue Descriptor) when a user queue is mapped.  Each user queue requires a
> +number of additional buffers which represent the ring buffer and any metadata
> +needed by the engine for runtime operation.  On most engines this consists of
> +the ring buffer itself, a rptr buffer (where the firmware will shadow the 
> rptr
> +to userspace), a wptr buffer (where the application will write the wptr for 
> the
> +firmware to fetch it), and a doorbell.  A doorbell is a piece of one of the
> +device's MMIO BARs which can be mapped to specific user queues.  When the
> +application writes to the doorbell, it will signal the firmware to take some
> +action. Writing to the doorbell wakes the firmware and causes it to fetch the
> +wptr and start processing the packets in the queue. Each 4K page of the 
> doorbell
> +BAR supports specific offset ranges for specific engines.  The doorbell of a
> +queue must be mapped into the aperture aligned to the IP used by the queue
> +(e.g., GFX, VCN, SDMA, etc.).  These doorbell apertures are set up via NBIO
> +registers.  Doorbells are 32 bit or 64 bit (depending on the engine) chunks 
> of
> +the doorbell BAR.  A 4K doorbell page provides 512 64-bit doorbells for up to
> +512 user queues.  A subset of each page is reserved for each IP type 
> supported
> +on the device.  The user can query the doorbell ranges for each IP via the 
> INFO
> +IOCTL.  See the IOCTL Interfaces section for more information.
> +
> +When an application wants to create a user queue, it allocates the necessary
> +buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
> +These can be separate buffers or all part of one larger buffer.  The 
> application
> +would map the buffer(s) into its GPUVM and use the GPU virtual addresses of 
> for
> +the areas of memory they want to use for the user queue.  They would also
> +allocate a doorbell page for the doorbells used by the user queues.  The
> +application would then populate the MQD in the USERQ IOCTL structure with the
> +GPU virtual addresses and doorbell index they want to use.  The user can also
> +specify the attributes for the user queue (priority, whether the queue is 
> secure
> +for protected content, etc.).  The application would then call the USERQ
> +CREATE IOCTL to create the queue using the specified MQD details in the 
> IOCTL.
> +The kernel driver then validates the MQD provided by the application and
> +translates the MQD into the engine specific MQD format for the IP.  The IP
> +specific MQD would be allocated and the queue would be added to the run list
> +maintained by the scheduling firmware.  Once the queue has been created, the
> +application can write packets directly into the queue, update the wptr, and
> +write to the doorbell offset to kick off work in the user queue.
> +
> +When the application is done with the user queue, it would call the USERQ
> +FREE IOCTL to destroy it.  The kernel driver would preempt the queue and
> +remove it from the scheduling firmware's run list.  Then the IP specific MQD
> +would be freed and the user queue state would be cleaned up.
> +
> +Some engines may require the aggregated doorbell too if the engine does not
> +support doorbells from unmapped queues.  The aggregated doorbell is a special
> +page of doorbell space which wakes the scheduler.  In cases where the engine 
> may
> +be oversubscribed, some queues may not be mapped.  If the doorbell is rung 
> when
> +the queue is not mapped, the engine firmware may miss the request.  Some
> +scheduling firmware may work around this by polling wptr shadows when the
> +hardware is oversubscribed, other engines may support doorbell updates from
> +unmapped queues.  In the event that one of these options is not available, 
> the
> +kernel driver will map a page of aggregated doorbell space into each GPUVM
> +space.  The UMD will then update the doorbell and wptr as normal and then 
> write
> +to the aggregated doorbell as well.
> +
> +Special Packets
> +---------------
> +
> +In order to support legacy implicit synchronization, as well as mixed user 
> and
> +kernel queues, we need a synchronization mechanism that is secure.  Because
> +kernel queues or memory management tasks depend on kernel fences, we need a 
> way
> +for user queues to update memory that the kernel can use for a fence, that 
> can't
> +be messed with by a bad actor.  To support this, we've added a protected 
> fence
> +packet.  This packet works by writing a monotonically increasing value to
> +a memory location that only privileged clients have write access to. User
> +queues only have read access.  When this packet is executed, the memory 
> location
> +is updated and other queues (kernel or user) can see the results.  The
> +user application would submit this packet in their command stream.  The 
> actual
> +packet format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the
> +behavior is the same.  The packet submission is handled in userspace.  The
> +kernel driver sets up the privileged memory used for each user queue when it
> +sets the queues up when the application creates them.
> +
> +
> +Memory Management
> +=================
> +
> +It is assumed that all buffers mapped into the GPUVM space for the process 
> are
> +valid when engines on the GPU are running.  The kernel driver will only allow
> +user queues to run when all buffers are mapped.  If there is a memory event 
> that
> +requires buffer migration, the kernel driver will preempt the user queues,
> +migrate buffers to where they need to be, update the GPUVM page tables and
> +invaldidate the TLB, and then resume the user queues.
> +
> +Interaction with Kernel Queues
> +==============================
> +
> +Depending on the IP and the scheduling firmware, you can enable kernel queues
> +and user queues at the same time, however, you are limited by the HQD slots.
> +Kernel queues are always mapped so any work that goes into kernel queues will
> +take priority.  This limits the available HQD slots for user queues.
> +
> +Not all IPs will support user queues on all GPUs.  As such, UMDs will need to
> +support both user queues and kernel queues depending on the IP.  For 
> example, a
> +GPU may support user queues for GFX, compute, and SDMA, but not for VCN, 
> JPEG,
> +and VPE.  UMDs need to support both.  The kernel driver provides a way to
> +determine if user queues and kernel queues are supported on a per IP basis.
> +UMDs can query this information via the INFO IOCTL and determine whether to 
> use
> +kernel queues or user queues for each IP.
> +
> +Queue Resets
> +============
> +
> +For most engines, queues can be reset individually.  GFX, compute, and SDMA
> +queues can be reset individually.  When a hung queue is detected, it can be
> +reset either via the scheduling firmware or MMIO.  Since there are no kernel
> +fences for most user queues, they will usually only be detected when some 
> other
> +event happens; e.g., a memory event which requires migration of buffers.  
> When
> +the queues are preempted, if the queue is hung, the preemption will fail.
> +Driver will then look up the queues that failed to preempt and reset them and
> +record which queues are hung.
> +
> +On the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue
> +status.  UMD will provide the queue id in the IOCTL and the kernel driver
> +will check if it has already recorded the queue as hung (e.g., due to failed
> +peemption) and report back the status.
> +
> +IOCTL Interfaces
> +================
> +
> +GPU virtual addresses used for queues and related data (rptrs, wptrs, context
> +save areas, etc.) should be validated by the kernel mode driver to prevent 
> the
> +user from specifying invalid GPU virtual addresses.  If the user provides
> +invalid GPU virtual addresses or doorbell indicies, the IOCTL should return 
> an
> +error message.  These buffers should also be tracked in the kernel driver so
> +that if the user attempts to unmap the buffer(s) from the GPUVM, the umap 
> call
> +would return an error.
> +
> +INFO
> +----
> +There are several new INFO queries related to user queues in order to query 
> the
> +size of user queue meta data needed for a user queue (e.g., context save 
> areas
> +or shadow buffers), whether kernel or user queues or both are supported
> +for each IP type, and the offsets for each IP type in each doorbell page.
> +
> +USERQ
> +-----
> +The USERQ IOCTL is used for creating, freeing, and querying the status of 
> user
> +queues.  It supports 3 opcodes:
> +
> +1. CREATE - Create a user queue.  The application provides an MQD-like 
> structure
> +   that defines the type of queue and associated metadata and flags for that
> +   queue type.  Returns the queue id.
> +2. FREE - Free a user queue.
> +3. QUERY_STATUS - Query that status of a queue.  Used to check if the queue 
> is
> +   healthy or not.  E.g., if the queue has been reset. (WIP)
> +
> +USERQ_SIGNAL
> +------------
> +The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be 
> signaled.
> +
> +USERQ_WAIT
> +----------
> +The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited 
> on.
> +
> +Kernel and User Queues
> +======================
> +
> +In order to properly validate and test performance, we have a driver option 
> to
> +select what type of queues are enabled (kernel queues, user queues or both).
> +The user_queue driver parameter allows you to enable kernel queues only (0),
> +user queues and kernel queues (1), and user queues only (2).  Enabling user
> +queues only will free up static queue assignments that would otherwise be 
> used
> +by kernel queues for use by the scheduling firmware.  Some kernel queues are
> +required for kernel driver operation and they will always be created.  When 
> the
> +kernel queues are not enabled, they are not registered with the drm scheduler
> +and the CS IOCTL will reject any incoming command submissions which target 
> those
> +queue types.  Kernel queues only mirrors the behavior on all existing GPUs.
> +Enabling both queues allows for backwards compatibility with old userspace 
> while
> +still supporting user queues.
> -- 
> 2.51.0
>


Reviewed-by: Rodrigo Siqueira <sique...@igalia.com> 

-- 
Rodrigo Siqueira

Re: [PATCH] Documentation: add initial documenation for user queues

Reply via email to