On Wed, Feb 25, 2026 at 07:27:47PM +0530, Ekansh Gupta wrote: > > > On 2/24/2026 2:47 AM, Dmitry Baryshkov wrote: > > On Tue, Feb 24, 2026 at 12:38:55AM +0530, Ekansh Gupta wrote: > >> Add initial documentation for the Qualcomm DSP Accelerator (QDA) driver > >> integrated in the DRM accel subsystem. > >> > >> The new docs introduce QDA as a DRM/accel-based implementation of > >> Hexagon DSP offload that is intended as a modern alternative to the > >> legacy FastRPC driver in drivers/misc. The text describes the driver > >> motivation, high-level architecture and interaction with IOMMU context > >> banks, GEM-based buffer management and the RPMsg transport. > >> > >> The user-space facing section documents the main QDA IOCTLs used to > >> establish DSP sessions, manage GEM buffer objects and invoke remote > >> procedures using the FastRPC protocol, along with a typical lifecycle > >> example for applications. > >> > >> Finally, the driver is wired into the Compute Accelerators > >> documentation index under Documentation/accel, and a brief debugging > >> section shows how to enable dynamic debug for the QDA implementation. > >> > >> Signed-off-by: Ekansh Gupta <[email protected]> > >> --- > >> Documentation/accel/index.rst | 1 + > >> Documentation/accel/qda/index.rst | 14 +++++ > >> Documentation/accel/qda/qda.rst | 129 > >> ++++++++++++++++++++++++++++++++++++++ > >> 3 files changed, 144 insertions(+) > >> > >> diff --git a/Documentation/accel/index.rst b/Documentation/accel/index.rst > >> index cbc7d4c3876a..5901ea7f784c 100644 > >> --- a/Documentation/accel/index.rst > >> +++ b/Documentation/accel/index.rst > >> @@ -10,4 +10,5 @@ Compute Accelerators > >> introduction > >> amdxdna/index > >> qaic/index > >> + qda/index > >> rocket/index > >> diff --git a/Documentation/accel/qda/index.rst > >> b/Documentation/accel/qda/index.rst > >> new file mode 100644 > >> index 000000000000..bce188f21117 > >> --- /dev/null > >> +++ b/Documentation/accel/qda/index.rst > >> @@ -0,0 +1,14 @@ > >> +.. SPDX-License-Identifier: GPL-2.0-only > >> + > >> +============================== > >> + accel/qda Qualcomm DSP Driver > >> +============================== > >> + > >> +The **accel/qda** driver provides support for Qualcomm Hexagon DSPs > >> (Digital > >> +Signal Processors) within the DRM accelerator framework. It serves as a > >> modern > >> +replacement for the legacy FastRPC driver, offering improved resource > >> management > >> +and standard subsystem integration. > >> + > >> +.. toctree:: > >> + > >> + qda > >> diff --git a/Documentation/accel/qda/qda.rst > >> b/Documentation/accel/qda/qda.rst > >> new file mode 100644 > >> index 000000000000..742159841b95 > >> --- /dev/null > >> +++ b/Documentation/accel/qda/qda.rst > >> @@ -0,0 +1,129 @@ > >> +.. SPDX-License-Identifier: GPL-2.0-only > >> + > >> +================================== > >> +Qualcomm Hexagon DSP (QDA) Driver > >> +================================== > >> + > >> +Introduction > >> +============ > >> + > >> +The **QDA** (Qualcomm DSP Accelerator) driver is a new DRM-based > >> +accelerator driver for Qualcomm's Hexagon DSPs. It provides a standardized > >> +interface for user-space applications to offload computational tasks > >> ranging > >> +from audio processing and sensor offload to computer vision and AI > >> +inference to the Hexagon DSPs found on Qualcomm SoCs. > >> + > >> +This driver is designed to align with the Linux kernel's modern **Compute > >> +Accelerators** subsystem (`drivers/accel/`), providing a robust and > >> modular > >> +alternative to the legacy FastRPC driver in `drivers/misc/`, offering > >> +improved resource management and better integration with standard kernel > >> +subsystems. > >> + > >> +Motivation > >> +========== > >> + > >> +The existing FastRPC implementation in the kernel utilizes a custom > >> character > >> +device and lacks integration with modern kernel memory management > >> frameworks. > >> +The QDA driver addresses these limitations by: > >> + > >> +1. **Adopting the DRM accel Framework**: Leveraging standard uAPIs for > >> device > >> + management, job submission, and synchronization. > >> +2. **Utilizing GEM for Memory**: Providing proper buffer object > >> management, > >> + including DMA-BUF import/export capabilities. > >> +3. **Improving Isolation**: Using IOMMU context banks to enforce memory > >> + isolation between different DSP user sessions. > >> + > >> +Key Features > >> +============ > >> + > >> +* **Standard Accelerator Interface**: Exposes a standard character > >> device > >> + node (e.g., `/dev/accel/accel0`) via the DRM subsystem. > >> +* **Unified Offload Support**: Supports all DSP domains (ADSP, CDSP, > >> SDSP, > >> + GDSP) via a single driver architecture. > >> +* **FastRPC Protocol**: Implements the reliable Remote Procedure Call > >> + (FastRPC) protocol for communication between the application processor > >> + and DSP. > >> +* **DMA-BUF Interop**: Seamless sharing of memory buffers between the > >> DSP > >> + and other multimedia subsystems (GPU, Camera, Video) via standard > >> DMA-BUFs. > >> +* **Modular Design**: Clean separation between the core DRM logic, the > >> memory > >> + manager, and the RPMsg-based transport layer. > >> + > >> +Architecture > >> +============ > >> + > >> +The QDA driver is composed of several modular components: > >> + > >> +1. **Core Driver (`qda_drv`)**: Manages device registration, file > >> operations, > >> + and bridges the driver with the DRM accelerator subsystem. > >> +2. **Memory Manager (`qda_memory_manager`)**: A flexible memory > >> management > >> + layer that handles IOMMU context banks. It supports pluggable backends > >> + (such as DMA-coherent) to adapt to different SoC memory architectures. > >> +3. **GEM Subsystem**: Implements the DRM GEM interface for buffer > >> management: > >> + > >> + * **`qda_gem`**: Core GEM object management, including allocation, > >> mmap > >> + operations, and buffer lifecycle management. > >> + * **`qda_prime`**: PRIME import functionality for DMA-BUF > >> interoperability, > >> + enabling seamless buffer sharing with other kernel subsystems. > >> + > >> +4. **Transport Layer (`qda_rpmsg`)**: Abstraction over the RPMsg > >> framework > >> + to handle low-level message passing with the DSP firmware. > >> +5. **Compute Bus (`qda_compute_bus`)**: A custom virtual bus used to > >> + enumerate and manage the specific compute context banks defined in the > >> + device tree. > > I'm really not sure if it's a bonus or not. I'm waiting for iommu-map > > improvements to land to send patches reworking FastRPC CB from using > > probe into being created by the main driver: it would remove some of the > > possible race conditions between main driver finishing probe and the CB > > devices probing in the background. > > > > What's the actual benefit of the CB bus? > I tried following the Tegra host1x logic here as was discussed here[1]. My > understanding is that > with this the CB will become more manageable reducing the scope of races that > exists in the > current fastrpc driver.
It's nice, but then it can also be used by the existing fastrpc driver. Would you mind splitting it to a separate changeset and submitting it? > > That said, I'm not completely aware about the iommu-map improvements. Is it > the one > being discussed for this patch[2]? If it helps in main driver to create CB > devices directly, then I > would be happy to adapt the design. That would mostly mean a change to the way we describe CBs (using the property instead of the in-tree subdevices). Anyway, as I wrote, please submit it separately. > > [1] > https://lore.kernel.org/all/[email protected]/ > [2] > https://lore.kernel.org/all/[email protected]/ > > > >> +6. **FastRPC Core (`qda_fastrpc`)**: Implements the protocol logic for > >> + marshalling arguments and handling remote invocations. > >> + > >> +User-Space API > >> +============== > >> + > >> +The driver exposes a set of DRM-compliant IOCTLs. Note that these are > >> designed > >> +to be familiar to existing FastRPC users while adhering to DRM standards. > >> + > >> +* `DRM_IOCTL_QDA_QUERY`: Query DSP type (e.g., "cdsp", "adsp") > >> + and capabilities. > >> +* `DRM_IOCTL_QDA_INIT_ATTACH`: Attach a user session to the DSP's > >> protection > >> + domain. > >> +* `DRM_IOCTL_QDA_INIT_CREATE`: Initialize a new process context on the > >> DSP. > > You need to explain the difference between these two. > Ack. > > > >> +* `DRM_IOCTL_QDA_INVOKE`: Submit a remote method invocation (the primary > >> + execution unit). > >> +* `DRM_IOCTL_QDA_GEM_CREATE`: Allocate a GEM buffer object for DSP > >> usage. > >> +* `DRM_IOCTL_QDA_GEM_MMAP_OFFSET`: Retrieve mmap offsets for memory > >> mapping. > >> +* `DRM_IOCTL_QDA_MAP` / `DRM_IOCTL_QDA_MUNMAP`: Map or unmap buffers > >> into the > >> + DSP's virtual address space. > > Do we need to make this separate? Can we map/unmap buffers on their > > usage? Or when they are created? I'm thinking about that the > > virtualization. > The lib provides ways(fastrpc_mmap/remote_mmap64) for users to map/unmap the > buffers on DSP as per processes requirement. The ioctls are added to support > the same. If the buffers are mapped, then library calls become empty calls. Let's focus on the API first and adapt to the library later on. > > An alternative approach would be to merge > > GET_MMAP_OFFSET with _MAP: once you map it to the DSP memory, you will > > get the offset. > _MAP is not need for all the buffers. Most of the remote call buffers that > are passed to DSP > are automatically mapped by DSP before invoking the DSP implementation so the > user-space > does not need to call _MAP for these. Is there a reason for that? I'd really prefer if we change it, making it more effective and more controllable. > > Some buffers(e.g., shared persistent buffers) do require explicit mapping, > which is why > MAP/MUNMAP exists in FastRPC. > > Because of this behavioral difference, merging GET_MMAP_OFFSET with MAP is > not accurate. > GET_MMAP_OFFSET is for CPU‑side mmap via GEM, whereas MAP is specifically for > DSP > virtual address assignment. > > > >> + > >> +Usage Example > >> +============= > >> + > >> +A typical lifecycle for a user-space application: > >> + > >> +1. **Discovery**: Open `/dev/accel/accel*` and check > >> + `DRM_IOCTL_QDA_QUERY` to find the desired DSP (e.g., CDSP for > >> + compute workloads). > >> +2. **Initialization**: Call `DRM_IOCTL_QDA_INIT_ATTACH` and > >> + `DRM_IOCTL_QDA_INIT_CREATE` to establish a session. > >> +3. **Memory**: Allocate buffers via `DRM_IOCTL_QDA_GEM_CREATE` or import > >> + DMA-BUFs (PRIME fd) from other drivers using > >> `DRM_IOCTL_PRIME_FD_TO_HANDLE`. > >> +4. **Execution**: Use `DRM_IOCTL_QDA_INVOKE` to pass arguments and > >> execute > >> + functions on the DSP. > >> +5. **Cleanup**: Close file descriptors to automatically release > >> resources and > >> + detach the session. > >> + > >> +Internal Implementation > >> +======================= > >> + > >> +Memory Management > >> +----------------- > >> +The driver's memory manager creates virtual "IOMMU devices" that map to > >> +hardware context banks. This allows the driver to manage multiple isolated > >> +address spaces. The implementation currently uses a **DMA-coherent > >> backend** > >> +to ensure data consistency between the CPU and DSP without manual cache > >> +maintenance in most cases. > >> + > >> +Debugging > >> +========= > >> +The driver includes extensive dynamic debug support. Enable it via the > >> +kernel's dynamic debug control: > >> + > >> +.. code-block:: bash > >> + > >> + echo "file drivers/accel/qda/* +p" > > >> /sys/kernel/debug/dynamic_debug/control > > Please add documentation on how to build the test apps and how to load > > them to the DSP. > Ack. > > > >> -- > >> 2.34.1 > >> > -- With best wishes Dmitry
