On Wed, Feb 25, 2026 at 07:27:47PM +0530, Ekansh Gupta wrote:
> 
> 
> On 2/24/2026 2:47 AM, Dmitry Baryshkov wrote:
> > On Tue, Feb 24, 2026 at 12:38:55AM +0530, Ekansh Gupta wrote:
> >> Add initial documentation for the Qualcomm DSP Accelerator (QDA) driver
> >> integrated in the DRM accel subsystem.
> >>
> >> The new docs introduce QDA as a DRM/accel-based implementation of
> >> Hexagon DSP offload that is intended as a modern alternative to the
> >> legacy FastRPC driver in drivers/misc. The text describes the driver
> >> motivation, high-level architecture and interaction with IOMMU context
> >> banks, GEM-based buffer management and the RPMsg transport.
> >>
> >> The user-space facing section documents the main QDA IOCTLs used to
> >> establish DSP sessions, manage GEM buffer objects and invoke remote
> >> procedures using the FastRPC protocol, along with a typical lifecycle
> >> example for applications.
> >>
> >> Finally, the driver is wired into the Compute Accelerators
> >> documentation index under Documentation/accel, and a brief debugging
> >> section shows how to enable dynamic debug for the QDA implementation.
> >>
> >> Signed-off-by: Ekansh Gupta <[email protected]>
> >> ---
> >>  Documentation/accel/index.rst     |   1 +
> >>  Documentation/accel/qda/index.rst |  14 +++++
> >>  Documentation/accel/qda/qda.rst   | 129 
> >> ++++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 144 insertions(+)
> >>
> >> diff --git a/Documentation/accel/index.rst b/Documentation/accel/index.rst
> >> index cbc7d4c3876a..5901ea7f784c 100644
> >> --- a/Documentation/accel/index.rst
> >> +++ b/Documentation/accel/index.rst
> >> @@ -10,4 +10,5 @@ Compute Accelerators
> >>     introduction
> >>     amdxdna/index
> >>     qaic/index
> >> +   qda/index
> >>     rocket/index
> >> diff --git a/Documentation/accel/qda/index.rst 
> >> b/Documentation/accel/qda/index.rst
> >> new file mode 100644
> >> index 000000000000..bce188f21117
> >> --- /dev/null
> >> +++ b/Documentation/accel/qda/index.rst
> >> @@ -0,0 +1,14 @@
> >> +.. SPDX-License-Identifier: GPL-2.0-only
> >> +
> >> +==============================
> >> + accel/qda Qualcomm DSP Driver
> >> +==============================
> >> +
> >> +The **accel/qda** driver provides support for Qualcomm Hexagon DSPs 
> >> (Digital
> >> +Signal Processors) within the DRM accelerator framework. It serves as a 
> >> modern
> >> +replacement for the legacy FastRPC driver, offering improved resource 
> >> management
> >> +and standard subsystem integration.
> >> +
> >> +.. toctree::
> >> +
> >> +   qda
> >> diff --git a/Documentation/accel/qda/qda.rst 
> >> b/Documentation/accel/qda/qda.rst
> >> new file mode 100644
> >> index 000000000000..742159841b95
> >> --- /dev/null
> >> +++ b/Documentation/accel/qda/qda.rst
> >> @@ -0,0 +1,129 @@
> >> +.. SPDX-License-Identifier: GPL-2.0-only
> >> +
> >> +==================================
> >> +Qualcomm Hexagon DSP (QDA) Driver
> >> +==================================
> >> +
> >> +Introduction
> >> +============
> >> +
> >> +The **QDA** (Qualcomm DSP Accelerator) driver is a new DRM-based
> >> +accelerator driver for Qualcomm's Hexagon DSPs. It provides a standardized
> >> +interface for user-space applications to offload computational tasks 
> >> ranging
> >> +from audio processing and sensor offload to computer vision and AI
> >> +inference to the Hexagon DSPs found on Qualcomm SoCs.
> >> +
> >> +This driver is designed to align with the Linux kernel's modern **Compute
> >> +Accelerators** subsystem (`drivers/accel/`), providing a robust and 
> >> modular
> >> +alternative to the legacy FastRPC driver in `drivers/misc/`, offering
> >> +improved resource management and better integration with standard kernel
> >> +subsystems.
> >> +
> >> +Motivation
> >> +==========
> >> +
> >> +The existing FastRPC implementation in the kernel utilizes a custom 
> >> character
> >> +device and lacks integration with modern kernel memory management 
> >> frameworks.
> >> +The QDA driver addresses these limitations by:
> >> +
> >> +1.  **Adopting the DRM accel Framework**: Leveraging standard uAPIs for 
> >> device
> >> +    management, job submission, and synchronization.
> >> +2.  **Utilizing GEM for Memory**: Providing proper buffer object 
> >> management,
> >> +    including DMA-BUF import/export capabilities.
> >> +3.  **Improving Isolation**: Using IOMMU context banks to enforce memory
> >> +    isolation between different DSP user sessions.
> >> +
> >> +Key Features
> >> +============
> >> +
> >> +*   **Standard Accelerator Interface**: Exposes a standard character 
> >> device
> >> +    node (e.g., `/dev/accel/accel0`) via the DRM subsystem.
> >> +*   **Unified Offload Support**: Supports all DSP domains (ADSP, CDSP, 
> >> SDSP,
> >> +    GDSP) via a single driver architecture.
> >> +*   **FastRPC Protocol**: Implements the reliable Remote Procedure Call
> >> +    (FastRPC) protocol for communication between the application processor
> >> +    and DSP.
> >> +*   **DMA-BUF Interop**: Seamless sharing of memory buffers between the 
> >> DSP
> >> +    and other multimedia subsystems (GPU, Camera, Video) via standard 
> >> DMA-BUFs.
> >> +*   **Modular Design**: Clean separation between the core DRM logic, the 
> >> memory
> >> +    manager, and the RPMsg-based transport layer.
> >> +
> >> +Architecture
> >> +============
> >> +
> >> +The QDA driver is composed of several modular components:
> >> +
> >> +1.  **Core Driver (`qda_drv`)**: Manages device registration, file 
> >> operations,
> >> +    and bridges the driver with the DRM accelerator subsystem.
> >> +2.  **Memory Manager (`qda_memory_manager`)**: A flexible memory 
> >> management
> >> +    layer that handles IOMMU context banks. It supports pluggable backends
> >> +    (such as DMA-coherent) to adapt to different SoC memory architectures.
> >> +3.  **GEM Subsystem**: Implements the DRM GEM interface for buffer 
> >> management:
> >> +
> >> +    * **`qda_gem`**: Core GEM object management, including allocation, 
> >> mmap
> >> +      operations, and buffer lifecycle management.
> >> +    * **`qda_prime`**: PRIME import functionality for DMA-BUF 
> >> interoperability,
> >> +      enabling seamless buffer sharing with other kernel subsystems.
> >> +
> >> +4.  **Transport Layer (`qda_rpmsg`)**: Abstraction over the RPMsg 
> >> framework
> >> +    to handle low-level message passing with the DSP firmware.
> >> +5.  **Compute Bus (`qda_compute_bus`)**: A custom virtual bus used to
> >> +    enumerate and manage the specific compute context banks defined in the
> >> +    device tree.
> > I'm really not sure if it's a bonus or not. I'm waiting for iommu-map
> > improvements to land to send patches reworking FastRPC CB from using
> > probe into being created by the main driver: it would remove some of the
> > possible race conditions between main driver finishing probe and the CB
> > devices probing in the background.
> >
> > What's the actual benefit of the CB bus?
> I tried following the Tegra host1x logic here as was discussed here[1]. My 
> understanding is that
> with this the CB will become more manageable reducing the scope of races that 
> exists in the
> current fastrpc driver.

It's nice, but then it can also be used by the existing fastrpc driver.
Would you mind splitting it to a separate changeset and submitting it?

> 
> That said, I'm not completely aware about the iommu-map improvements. Is it 
> the one
> being discussed for this patch[2]? If it helps in main driver to create CB 
> devices directly, then I
> would be happy to adapt the design.

That would mostly mean a change to the way we describe CBs (using the
property instead of the in-tree subdevices). Anyway, as I wrote, please
submit it separately.

> 
> [1] 
> https://lore.kernel.org/all/[email protected]/
> [2] 
> https://lore.kernel.org/all/[email protected]/
> >
> >> +6.  **FastRPC Core (`qda_fastrpc`)**: Implements the protocol logic for
> >> +    marshalling arguments and handling remote invocations.
> >> +
> >> +User-Space API
> >> +==============
> >> +
> >> +The driver exposes a set of DRM-compliant IOCTLs. Note that these are 
> >> designed
> >> +to be familiar to existing FastRPC users while adhering to DRM standards.
> >> +
> >> +*   `DRM_IOCTL_QDA_QUERY`: Query DSP type (e.g., "cdsp", "adsp")
> >> +    and capabilities.
> >> +*   `DRM_IOCTL_QDA_INIT_ATTACH`: Attach a user session to the DSP's 
> >> protection
> >> +    domain.
> >> +*   `DRM_IOCTL_QDA_INIT_CREATE`: Initialize a new process context on the 
> >> DSP.
> > You need to explain the difference between these two.
> Ack.
> >
> >> +*   `DRM_IOCTL_QDA_INVOKE`: Submit a remote method invocation (the primary
> >> +    execution unit).
> >> +*   `DRM_IOCTL_QDA_GEM_CREATE`: Allocate a GEM buffer object for DSP 
> >> usage.
> >> +*   `DRM_IOCTL_QDA_GEM_MMAP_OFFSET`: Retrieve mmap offsets for memory 
> >> mapping.
> >> +*   `DRM_IOCTL_QDA_MAP` / `DRM_IOCTL_QDA_MUNMAP`: Map or unmap buffers 
> >> into the
> >> +    DSP's virtual address space.
> > Do we need to make this separate? Can we map/unmap buffers on their
> > usage? Or when they are created? I'm thinking about that the
> > virtualization. 
> The lib provides ways(fastrpc_mmap/remote_mmap64) for users to map/unmap the
> buffers on DSP as per processes requirement. The ioctls are added to support 
> the same.

If the buffers are mapped, then library calls become empty calls. Let's
focus on the API first and adapt to the library later on.

> > An alternative approach would be to merge
> > GET_MMAP_OFFSET with _MAP: once you map it to the DSP memory, you will
> > get the offset. 
> _MAP is not need for all the buffers. Most of the remote call buffers that 
> are passed to DSP
> are automatically mapped by DSP before invoking the DSP implementation so the 
> user-space
> does not need to call _MAP for these.

Is there a reason for that? I'd really prefer if we change it, making it
more effective and more controllable. 

> 
> Some buffers(e.g., shared persistent buffers) do require explicit mapping, 
> which is why
> MAP/MUNMAP exists in FastRPC.
> 
> Because of this behavioral difference, merging GET_MMAP_OFFSET with MAP is 
> not accurate.
> GET_MMAP_OFFSET is for CPU‑side mmap via GEM, whereas MAP is specifically for 
> DSP
> virtual address assignment.
> >
> >> +
> >> +Usage Example
> >> +=============
> >> +
> >> +A typical lifecycle for a user-space application:
> >> +
> >> +1.  **Discovery**: Open `/dev/accel/accel*` and check
> >> +    `DRM_IOCTL_QDA_QUERY` to find the desired DSP (e.g., CDSP for
> >> +    compute workloads).
> >> +2.  **Initialization**: Call `DRM_IOCTL_QDA_INIT_ATTACH` and
> >> +    `DRM_IOCTL_QDA_INIT_CREATE` to establish a session.
> >> +3.  **Memory**: Allocate buffers via `DRM_IOCTL_QDA_GEM_CREATE` or import
> >> +    DMA-BUFs (PRIME fd) from other drivers using 
> >> `DRM_IOCTL_PRIME_FD_TO_HANDLE`.
> >> +4.  **Execution**: Use `DRM_IOCTL_QDA_INVOKE` to pass arguments and 
> >> execute
> >> +    functions on the DSP.
> >> +5.  **Cleanup**: Close file descriptors to automatically release 
> >> resources and
> >> +    detach the session.
> >> +
> >> +Internal Implementation
> >> +=======================
> >> +
> >> +Memory Management
> >> +-----------------
> >> +The driver's memory manager creates virtual "IOMMU devices" that map to
> >> +hardware context banks. This allows the driver to manage multiple isolated
> >> +address spaces. The implementation currently uses a **DMA-coherent 
> >> backend**
> >> +to ensure data consistency between the CPU and DSP without manual cache
> >> +maintenance in most cases.
> >> +
> >> +Debugging
> >> +=========
> >> +The driver includes extensive dynamic debug support. Enable it via the
> >> +kernel's dynamic debug control:
> >> +
> >> +.. code-block:: bash
> >> +
> >> +    echo "file drivers/accel/qda/* +p" > 
> >> /sys/kernel/debug/dynamic_debug/control
> > Please add documentation on how to build the test apps and how to load
> > them to the DSP.
> Ack.
> >
> >> -- 
> >> 2.34.1
> >>
> 

-- 
With best wishes
Dmitry

Reply via email to