Hello Midgy,

On 6/4/2026 9:52 PM, Midgy BALON wrote:
> RFC, not for merge. End-to-end inference does not produce correct output
> yet (see Status), so per the v2 discussion this is a request for design
> feedback. It now probes, attaches, and submits cleanly on a stock
> v7.1-rc6 tree; what remains is one hardware-internal issue.
> 
> The RK3568 has a single NVDLA-derived NPU core, the same IP family as the
> RK3588 NPU the driver already supports; the register layout matches. The
> RK3568 differences are a 32-bit NPU AXI/IOMMU (vs 40-bit) and explicit
> PVTPLL/PMU bring-up to power and de-idle the NPU before it is reachable.
> 
> Patches:
>   1-2  rocket: per-SoC data struct, then derive DMA width and core count
>        from match data (refactors, no functional change).
>   3    rocket: RK3568 SoC data + PVTPLL/PMU/NOC bring-up.
>   4    rocket: reset the NPU before detaching the IOMMU on a job timeout
>        (the detach otherwise stalls a wedged AXI master and WARNs).
>   5    rocket: keep the IOMMU domain attached across jobs instead of
>        re-attaching per job (the per-job rk_iommu handshake on the idle
>        NPU MMU is slow and noisy).
>   6    iommu/rockchip: clear AUTO_GATING bit 1 on the RK356x v1 IOMMU so
>        the page-walker keeps its clock (else a TLB-miss walk never
>        completes).
>   7    dt-bindings: add the RK3568 NPU compatible.
>   8-9  arm64 dts: add the NPU and its IOMMU, and enable them on ROCK 3B.
> 
> Dependency. The NPU MMU is rockchip-iommu v1 (32-bit) while the rest of
> the RK3568 uses v2 (40-bit). They cannot coexist until the driver carries
> per-device ops; this series is developed on top of Simon Xue's
> "iommu/rockchip: Drop global rk_ops in favor of per-device ops" [1].
> Without it the NPU IOMMU fails to probe on a full RK3568 boot.
>

Hmmm. If I understand correctly, the NPU IOMMU should be v2 rather than
v1, implying it should support 40-bit PAs. Nevertheless, please note that
the upper limit for DTE is 32 bits.

> Power bring-up. The NPU is brought up through the power-domain layer (no
> driver hack): the NPU power-domain keeps its clocks but drops the pm_qos
> phandle (qos_npu sits behind the gated NPU NoC, so genpd's power-off QoS
> save faults reading it), and vdd_npu is marked always-on so the rail is
> up before genpd de-idles the NoC at power-on. The PMU de-idle then ACKs
> without PVTPLL running; PVTPLL is only needed for compute.
>

Can these operations not be completed via the pmdomain driver?
If some operations are controlled by TF-A, are you using open
source TF-A? Thank you.

> Status. On v7.1-rc6 the driver probes, creates /dev/accel/accel0,
> attaches an IOMMU domain, and submits jobs; the program controller
> fetches and broadcasts the command list. Inference output is still wrong,
> and the cause is split across three layers:
>   - kernel (this series): the RK3568 differences appear handled;
>   - mesa/Teflon userspace: still emits RK3588-tuned config, wrong for
>     RK3568 (to be filed separately on mesa-dev);
>   - hardware: with corrected config the NPU's DMA reads the full input
>     and weight tensors (confirmed via its DMA bandwidth counters), but
>     the MAC/output stage never completes, the job times out, and the
>     output stays at the buffer's zero-point. I have not found the missing
>     step; it is not in the command list (replaying the vendor's
>     byte-exact command list behaves the same). Pointers welcome,
>     especially from anyone with RK3568 NPU experience.
> 
> Known residual. On the first IOMMU attach the NPU MMU is idle with paging
> already enabled; the rk_iommu stall/reset handshake does not complete in
> that state and logs one burst of timeouts before the (kept) domain
> settles. It is harmless here because the job times out regardless, but it
> points at an idle-MMU reconfiguration corner the rk_iommu code does not
> handle on this block.
> 
> [1] 
> https://lore.kernel.org/linux-rockchip/[email protected]/
> 
> Changes since v2:
>   - Tagged RFC; now tested on a stock v7.1-rc6 tree.
>   - Bring-up moved into the power-domain/DT layer (no initcall hack).
>   - Added the IOMMU detach-on-timeout and attach-once driver fixes.
>   - Split the driver patch (Heiko): soc_data / match-data / RK3568.
>   - Derive DMA width and core count from match data; drop the DT rescans.
>   - Binding describes the hardware; added the missing $ref on rockchip,pmu.
>   - Disclosed the per-device-ops IOMMU dependency.
> 
> Midgy BALON (9):
>   accel: rocket: Introduce per-SoC rocket_soc_data
>   accel: rocket: Derive DMA width and core count from match data
>   accel: rocket: Add RK3568 SoC support
>   accel: rocket: Reset the NPU before detaching the IOMMU on timeout
>   accel: rocket: Keep the IOMMU domain attached across jobs
>   iommu/rockchip: Clear AUTO_GATING bit 1 on the RK356x v1 IOMMU
>   dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568
>   arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU
>   arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU
> 
>  .../npu/rockchip,rk3588-rknn-core.yaml        | 18 ++++-
>  .../boot/dts/rockchip/rk3568-rock-3b.dts      | 14 +++-
>  arch/arm64/boot/dts/rockchip/rk356x-base.dtsi | 38 +++++++++++
>  drivers/accel/rocket/rocket_core.c            | 22 ++++++-
>  drivers/accel/rocket/rocket_core.h            | 19 ++++++
>  drivers/accel/rocket/rocket_device.c          | 15 ++---
>  drivers/accel/rocket/rocket_device.h          |  3 +-
>  drivers/accel/rocket/rocket_drv.c             | 66 ++++++++++++++++++-
>  drivers/accel/rocket/rocket_job.c             | 35 ++++++++--
>  drivers/iommu/rockchip-iommu.c                | 12 ++++
>  10 files changed, 219 insertions(+), 23 deletions(-)
> 
> 
> base-commit: 52c800fdcf11888ebeb50c3d707f782cc15b66eb

-- 
Best, 
Chaoyi

Reply via email to