Hello Midgy, On 6/4/2026 9:52 PM, Midgy BALON wrote: > RFC, not for merge. End-to-end inference does not produce correct output > yet (see Status), so per the v2 discussion this is a request for design > feedback. It now probes, attaches, and submits cleanly on a stock > v7.1-rc6 tree; what remains is one hardware-internal issue. > > The RK3568 has a single NVDLA-derived NPU core, the same IP family as the > RK3588 NPU the driver already supports; the register layout matches. The > RK3568 differences are a 32-bit NPU AXI/IOMMU (vs 40-bit) and explicit > PVTPLL/PMU bring-up to power and de-idle the NPU before it is reachable. > > Patches: > 1-2 rocket: per-SoC data struct, then derive DMA width and core count > from match data (refactors, no functional change). > 3 rocket: RK3568 SoC data + PVTPLL/PMU/NOC bring-up. > 4 rocket: reset the NPU before detaching the IOMMU on a job timeout > (the detach otherwise stalls a wedged AXI master and WARNs). > 5 rocket: keep the IOMMU domain attached across jobs instead of > re-attaching per job (the per-job rk_iommu handshake on the idle > NPU MMU is slow and noisy). > 6 iommu/rockchip: clear AUTO_GATING bit 1 on the RK356x v1 IOMMU so > the page-walker keeps its clock (else a TLB-miss walk never > completes). > 7 dt-bindings: add the RK3568 NPU compatible. > 8-9 arm64 dts: add the NPU and its IOMMU, and enable them on ROCK 3B. > > Dependency. The NPU MMU is rockchip-iommu v1 (32-bit) while the rest of > the RK3568 uses v2 (40-bit). They cannot coexist until the driver carries > per-device ops; this series is developed on top of Simon Xue's > "iommu/rockchip: Drop global rk_ops in favor of per-device ops" [1]. > Without it the NPU IOMMU fails to probe on a full RK3568 boot. >
Hmmm. If I understand correctly, the NPU IOMMU should be v2 rather than v1, implying it should support 40-bit PAs. Nevertheless, please note that the upper limit for DTE is 32 bits. > Power bring-up. The NPU is brought up through the power-domain layer (no > driver hack): the NPU power-domain keeps its clocks but drops the pm_qos > phandle (qos_npu sits behind the gated NPU NoC, so genpd's power-off QoS > save faults reading it), and vdd_npu is marked always-on so the rail is > up before genpd de-idles the NoC at power-on. The PMU de-idle then ACKs > without PVTPLL running; PVTPLL is only needed for compute. > Can these operations not be completed via the pmdomain driver? If some operations are controlled by TF-A, are you using open source TF-A? Thank you. > Status. On v7.1-rc6 the driver probes, creates /dev/accel/accel0, > attaches an IOMMU domain, and submits jobs; the program controller > fetches and broadcasts the command list. Inference output is still wrong, > and the cause is split across three layers: > - kernel (this series): the RK3568 differences appear handled; > - mesa/Teflon userspace: still emits RK3588-tuned config, wrong for > RK3568 (to be filed separately on mesa-dev); > - hardware: with corrected config the NPU's DMA reads the full input > and weight tensors (confirmed via its DMA bandwidth counters), but > the MAC/output stage never completes, the job times out, and the > output stays at the buffer's zero-point. I have not found the missing > step; it is not in the command list (replaying the vendor's > byte-exact command list behaves the same). Pointers welcome, > especially from anyone with RK3568 NPU experience. > > Known residual. On the first IOMMU attach the NPU MMU is idle with paging > already enabled; the rk_iommu stall/reset handshake does not complete in > that state and logs one burst of timeouts before the (kept) domain > settles. It is harmless here because the job times out regardless, but it > points at an idle-MMU reconfiguration corner the rk_iommu code does not > handle on this block. > > [1] > https://lore.kernel.org/linux-rockchip/[email protected]/ > > Changes since v2: > - Tagged RFC; now tested on a stock v7.1-rc6 tree. > - Bring-up moved into the power-domain/DT layer (no initcall hack). > - Added the IOMMU detach-on-timeout and attach-once driver fixes. > - Split the driver patch (Heiko): soc_data / match-data / RK3568. > - Derive DMA width and core count from match data; drop the DT rescans. > - Binding describes the hardware; added the missing $ref on rockchip,pmu. > - Disclosed the per-device-ops IOMMU dependency. > > Midgy BALON (9): > accel: rocket: Introduce per-SoC rocket_soc_data > accel: rocket: Derive DMA width and core count from match data > accel: rocket: Add RK3568 SoC support > accel: rocket: Reset the NPU before detaching the IOMMU on timeout > accel: rocket: Keep the IOMMU domain attached across jobs > iommu/rockchip: Clear AUTO_GATING bit 1 on the RK356x v1 IOMMU > dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568 > arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU > arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU > > .../npu/rockchip,rk3588-rknn-core.yaml | 18 ++++- > .../boot/dts/rockchip/rk3568-rock-3b.dts | 14 +++- > arch/arm64/boot/dts/rockchip/rk356x-base.dtsi | 38 +++++++++++ > drivers/accel/rocket/rocket_core.c | 22 ++++++- > drivers/accel/rocket/rocket_core.h | 19 ++++++ > drivers/accel/rocket/rocket_device.c | 15 ++--- > drivers/accel/rocket/rocket_device.h | 3 +- > drivers/accel/rocket/rocket_drv.c | 66 ++++++++++++++++++- > drivers/accel/rocket/rocket_job.c | 35 ++++++++-- > drivers/iommu/rockchip-iommu.c | 12 ++++ > 10 files changed, 219 insertions(+), 23 deletions(-) > > > base-commit: 52c800fdcf11888ebeb50c3d707f782cc15b66eb -- Best, Chaoyi
