Hi, This is RFC v4 of the NTB/PCI/dmaengine series that introduces an optional NTB transport variant where payload data is moved by a PCI embedded-DMA engine (eDMA) residing on the endpoint side.
The primary target is Synopsys DesignWare PCIe endpoint controllers that integrate a DesignWare eDMA instance (dw-edma). In the remote embedded-DMA mode, payload is transferred by DMA directly between the two systems' memory, and NTB Memory Windows are used primarily for control/metadata and for exposing the endpoint eDMA resources (register window + linked-list rings) to the host. Compared to the existing cpu/dma memcpy-based implementation, this approach avoids window-backed payload rings and the associated extra copies, and it is less sensitive to scarce MW space. This also enables scaling out to multiple queue pairs, which is particularly beneficial for ntb_netdev. On R-Car S4, preliminary iperf3 results show 10~20x throughput improvement. Latency improvements are also observed. RFC history: RFC v3: https://lore.kernel.org/all/[email protected]/ RFC v2: https://lore.kernel.org/all/[email protected]/ RFC v1: https://lore.kernel.org/all/[email protected]/ Parts of RFC v3 series have already been split out and posted separately (see "Kernel base / dependencies" section below). However, feedback on the remaining parts led to substantial restructuring and code changes, so I am sending an RFC v4 as a refreshed version of the full series. RFC v4 is still a large, cross-subsystem series. At this RFC stage, I am sending the full picture in a single set to make it easier to review the overall direction and architecture. Once the direction is agreed upon and no further large restructuring appears necessary, I will stop posting the new RFC-tagged revisions and continue development on separate threads, split by sub-topic. Many thanks for all the reviews and feedback from multiple perspectives. Software architecture overview (RFC v4) ======================================= A major change in RFC v4 is the software layering and module split. The existing memcpy-based transport and the new remote embedded-DMA transport are implemented as two independent NTB client drivers on top of a shared core library: +--------------------+ | ntb_transport_core | +--------------------+ ^ ^ | | ntb_transport -----+ +----- ntb_transport_edma (cpu/dma memcpy) (remote embedded DMA transfer) | v +-----------+ | ntb_edma | +-----------+ ^ | +----------------+ | | ntb_dw_edma [...] Key points: * ntb_transport_core provides the queue-pair abstraction used by upper layer clients (e.g. ntb_netdev). * ntb_transport is the legacy shared-memory transport client (CPU/DMA memcpy). * ntb_transport_edma is the remote embedded-DMA transport client. * ntb_transport_edma relies on an ntb_edma backend registry. This RFC provides an initial DesignWare backend (ntb_dw_edma). * Transport selection is per-NTB device via the standard driver_override mechanism. To enable that, this RFC adds driver_override support to ntb_bus. This allows mixing transports across multiple NTB ports and provides an explicit fallback path to the legacy transport. So, if ntb_transport / ntb_transport_edma are built as loadable modules, you can just run modprobe ntb_transport as before and the original cpu/dma memcpy-based implementation will be active. If they are built-in, whether ntb_transport or ntb_transport_edma are bound by default depends on initcall order. Regarding how to switch the driver, please see Patch 34 ("Documentation: driver-api: ntb: Document remote embedded-DMA transport") for details. Data flow overview (remote embedded-DMA transport) ================================================== At a high level: * One MW is reserved as an "eDMA window". The endpoint exposes the eDMA register block plus LL descriptor rings through that window, so the peer can ioremap it and drive DMA reads remotely. * Remaining MWs carry only small control-plane rings used to exchange buffer addresses and completion information. * For RC->EP traffic, the RC drives endpoint DMA read channels through the peer-visible eDMA window. * For EP->RC traffic, the endpoint uses its local DMA write channels. The following figures illustrate the data flow when ntb_netdev sits on top of the transport: Figure 1. RC->EP traffic via ntb_netdev + ntb_transport_edma backed by ntb_edma/ntb_dw_edma EP RC phys addr phys addr space space +-+ +-+ | | | | | | || | | +-+-----. || | | EDMA REG | | \ [A] || | | +-+----. '---+-+ || | | | | \ | |<---------[0-a]---------- +-+-----------| |<----------[2]----------. EDMA LL | | | | || | | : | | | | || | | : +-+-----------+-+ || [B] | | : | | || ++ | | : ---------[0-b]----------->||----------------' | | ++ || || | | | | || || ++ | | | | ||<----------[4]----------- | | ++ || | | | | [C] || | | .--|#|<------------------------[3]------|#|<-. : |#| || |#| : [5] | | || | | [1] : | | || | | : '->|#| |#|--' |#| |#| | | | | Figure 2. EP->RC traffic via ntb_netdev + ntb_transport_edma backed by ntb_edma/ntb_dw_edma EP RC phys addr phys addr space space +-+ +-+ | | | | | | || | | +-+ || | | EDMA REG | | || | | +-+ || | | ^ | | || | | : +-+ || | | : EDMA LL| | || | | : | | || | | : +-+ || [C] | | : | | || ++ | | : -----------[4]----------->|| | | : | | ++ || || | | : | | || || ++ | | '----------------[2]-----||<--------[0-b]----------- | | ++ || | | | | [B] || | | .->|#|--------[3]---------------------->|#|--. : |#| || |#| : [1] | | || | | [5] : | | || | | : '--|#| |#|<-' |#| |#| | | | | 0-a. configure remote embedded DMA (program endpoint DMA registers) 0-b. DMA-map and publish destination address (DAR) 1. network stack builds skb (copy from application/user memory) 2. consume DAR, DMA-map source address (SAR) and kick DMA transfer 3. DMA transfer (payload moves between RC/EP memory) 4. consume completion (commit) 5. network stack delivers data to application/user memory [A]: Dedicated MW that aggregates DMA regs and LL (peer ioremaps it) [B]: Control-plane ring buffer for "produce" [C]: Control-plane ring buffer for "consume" Kernel base / dependencies ========================== This series is based on: - next-20260114 (commit b775e489bec7) plus the following seven unmerged patch series or standalone patches: - [PATCH v4 0/7] PCI: endpoint/NTB: Harden vNTB resource management https://lore.kernel.org/all/[email protected]/ - [PATCH v2 0/2] NTB: ntb_transport: debugfs cleanups https://lore.kernel.org/all/[email protected]/ - [PATCH v3 0/9] dmaengine: Add new API to combine configuration and descriptor preparation https://lore.kernel.org/all/[email protected]/ - [PATCH v8 0/5] PCI: endpoint: BAR subrange mapping support https://lore.kernel.org/all/[email protected]/ - [PATCH] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access https://lore.kernel.org/all/[email protected]/ - [PATCH] dmaengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts https://lore.kernel.org/all/[email protected]/ - [PATCH v2 01/11] dmaengine: dw-edma: Add spinlock to protect DONE_INT_MASK and ABORT_INT_MASK https://lore.kernel.org/imx/[email protected]/ (only this single commit is cherry-picked from the series) Patch layout ============ 1. dw-edma / DesignWare EP helpers needed for remote embedded-DMA (export register/LL windows, IRQ routing control, etc.) Patch 01 : dmaengine: dw-edma: Export helper to get integrated register window Patch 02 : dmaengine: dw-edma: Add per-channel interrupt routing control Patch 03 : dmaengine: dw-edma: Poll completion when local IRQ handling is disabled Patch 04 : dmaengine: dw-edma: Add notify-only channels support Patch 05 : dmaengine: dw-edma: Add a helper to query linked-list region 2. NTB EPF/core + vNTB prep (mwN_offset + versioning, MSI vector management, new ntb_dev_ops helpers, driver_override, vntb glue) Patch 06 : NTB: epf: Add mwN_offset support and config region versioning Patch 07 : NTB: epf: Reserve a subset of MSI vectors for non-NTB users Patch 08 : NTB: epf: Provide db_vector_count/db_vector_mask callbacks Patch 09 : NTB: core: Add mw_set_trans_ranges() for subrange programming Patch 10 : NTB: core: Add .get_private_data() to ntb_dev_ops Patch 11 : NTB: core: Add .get_dma_dev() to ntb_dev_ops Patch 12 : NTB: core: Add driver_override support for NTB devices Patch 13 : PCI: endpoint: pci-epf-vntb: Support BAR subrange mappings for MWs Patch 14 : PCI: endpoint: pci-epf-vntb: Implement .get_private_data() callback Patch 15 : PCI: endpoint: pci-epf-vntb: Implement .get_dma_dev() 3. ntb_transport refactor/modularization and backend infrastructure Patch 16 : NTB: ntb_transport: Move TX memory window setup into setup_qp_mw() Patch 17 : NTB: ntb_transport: Dynamically determine qp count Patch 18 : NTB: ntb_transport: Use ntb_get_dma_dev() Patch 19 : NTB: ntb_transport: Rename ntb_transport.c to ntb_transport_core.c Patch 20 : NTB: ntb_transport: Move internal types to ntb_transport_internal.h Patch 21 : NTB: ntb_transport: Export common helpers for modularization Patch 22 : NTB: ntb_transport: Split core library and default NTB client Patch 23 : NTB: ntb_transport: Add transport backend infrastructure Patch 24 : NTB: ntb_transport: Run ntb_set_mw() before link-up negotiation 4. ntb_edma backend registry + DesignWare backend + transport client Patch 25 : NTB: hw: Add remote eDMA backend registry and DesignWare backend Patch 26 : NTB: ntb_transport: Add remote embedded-DMA transport client 5. ntb_netdev multi-queue support Patch 27 : ntb_netdev: Multi-queue support 6. Renesas R-Car S4 enablement (IOMMU, DTs, quirks) Patch 28 : iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist Patch 29 : iommu: ipmmu-vmsa: Add support for reserved regions Patch 30 : arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe eDMA Patch 31 : NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car) Patch 32 : NTB: epf: Add an additional memory window (MW2) barno mapping on Renesas R-Car 7. Documentation updates Patch 33 : Documentation: PCI: endpoint: pci-epf-vntb: Update and add mwN_offset usage Patch 34 : Documentation: driver-api: ntb: Document remote embedded-DMA transport 8. pci-epf-test / pci_endpoint_test / kselftest coverage for remote eDMA Patch 35 : PCI: endpoint: pci-epf-test: Add pci_epf_test_next_free_bar() helper Patch 36 : PCI: endpoint: pci-epf-test: Add remote eDMA-backed mode Patch 37 : misc: pci_endpoint_test: Add remote eDMA transfer test mode Patch 38 : selftests: pci_endpoint: Add remote eDMA transfer coverage Tested on ========= * 2x Renesas R-Car S4 Spider (RC<->EP connected with OCuLink cable) * Kernel base as described above Performance notes ================= The primary motivation remains improving throughput/latency for ntb_transport users (typically ntb_netdev). On R-Car S4, the earlier prototype (RFC v3) showed roughly 10-20x throughput improvement in preliminary iperf3 tests and lower ping RTT. I have not yet re-measured after the v4 refactor and module split. Changelog ========= RFCv3->RFCv4 changes: - Major refactor of the transport layering: - Introduce ntb_transport_core as a shared library module. - Split the legacy shared-memory transport client (ntb_transport) and the remote embedded-DMA transport client (ntb_transport_edma). - Add driver_override support for ntb_bus and use it for per-port transport selection. - Introduce a vendor-agnostic remote embedded-DMA backend registry (ntb_edma) and add the initial DesignWare backend (ntb_dw_edma). - Rebase to next-20260114 and move several prerequisite/fixup patchsets into separate threads (listed above), including BAR subrange mapping support and dw-edma fixes. - Add PCI endpoint test coverage for the remote embedded-DMA path: - extend pci-epf-test / pci_endpoint_test - add a kselftest variant to exercise remote-eDMA transfers Note: to keep the changes as small as possible, I added a few #ifdefs in the main test code. Feedback on whether/how/to what extent this should be split into separate modules would be appreciated. - Expand documentation (Documentation/driver-api/ntb.rst) to describe transport variants, the new module structure, and the remote embedded-DMA data flow. - Addressed other feedbacks from the RFC v3 thread. RFCv2->RFCv3 changes: - Architecture - Have EP side use its local write channels, while leaving RC side to use remote read channels. - Abstraction/HW-specific stuff encapsulation improved. - Added control/config region versioning for the vNTB/EPF control region so that mismatched RC/EP kernels fail early instead of silently using an incompatible layout. - Reworked BAR subrange / multi-region mapping support: - Dropped the v2 approach that added new inbound mapping ops in the EPC core. - Introduced `struct pci_epf_bar.submap` and extended DesignWare EP to support BAR subrange inbound mapping via Address Match Mode IB iATU. - pci-epf-vntb now provides a subrange mapping hint to the EPC driver when offsets are used. - Changed .get_pci_epc() to .get_private_data() - Dropped two commits from RFC v2 that should be submitted separately: (1) ntb_transport debugfs seq_file conversion (2) DWC EP outbound iATU MSI mapping/cache fix (will be re-posted separately) - Added documentation updates. - Addressed assorted review nits from the RFC v2 thread (naming/structure). RFCv1->RFCv2 changes: - Architecture - Drop the generic interrupt backend + DW eDMA test-interrupt backend approach and instead adopt the remote eDMA-backed ntb_transport mode proposed by Frank Li. The BAR-sharing / mwN_offset / inbound mapping (Address Match Mode) infrastructure from RFC v1 is largely kept, with only minor refinements and code motion where necessary to fit the new transport-mode design. - For Patch 01 - Rework the array_index_nospec() conversion to address review comments on "[RFC PATCH 01/25]". RFCv3: https://lore.kernel.org/all/[email protected]/ RFCv2: https://lore.kernel.org/all/[email protected]/ RFCv1: https://lore.kernel.org/all/[email protected]/ Thank you for reviewing, Koichiro Den (38): dmaengine: dw-edma: Export helper to get integrated register window dmaengine: dw-edma: Add per-channel interrupt routing control dmaengine: dw-edma: Poll completion when local IRQ handling is disabled dmaengine: dw-edma: Add notify-only channels support dmaengine: dw-edma: Add a helper to query linked-list region NTB: epf: Add mwN_offset support and config region versioning NTB: epf: Reserve a subset of MSI vectors for non-NTB users NTB: epf: Provide db_vector_count/db_vector_mask callbacks NTB: core: Add mw_set_trans_ranges() for subrange programming NTB: core: Add .get_private_data() to ntb_dev_ops NTB: core: Add .get_dma_dev() to ntb_dev_ops NTB: core: Add driver_override support for NTB devices PCI: endpoint: pci-epf-vntb: Support BAR subrange mappings for MWs PCI: endpoint: pci-epf-vntb: Implement .get_private_data() callback PCI: endpoint: pci-epf-vntb: Implement .get_dma_dev() NTB: ntb_transport: Move TX memory window setup into setup_qp_mw() NTB: ntb_transport: Dynamically determine qp count NTB: ntb_transport: Use ntb_get_dma_dev() NTB: ntb_transport: Rename ntb_transport.c to ntb_transport_core.c NTB: ntb_transport: Move internal types to ntb_transport_internal.h NTB: ntb_transport: Export common helpers for modularization NTB: ntb_transport: Split core library and default NTB client NTB: ntb_transport: Add transport backend infrastructure NTB: ntb_transport: Run ntb_set_mw() before link-up negotiation NTB: hw: Add remote eDMA backend registry and DesignWare backend NTB: ntb_transport: Add remote embedded-DMA transport client ntb_netdev: Multi-queue support iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist iommu: ipmmu-vmsa: Add support for reserved regions arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe eDMA NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car) NTB: epf: Add an additional memory window (MW2) barno mapping on Renesas R-Car Documentation: PCI: endpoint: pci-epf-vntb: Update and add mwN_offset usage Documentation: driver-api: ntb: Document remote embedded-DMA transport PCI: endpoint: pci-epf-test: Add pci_epf_test_next_free_bar() helper PCI: endpoint: pci-epf-test: Add remote eDMA-backed mode misc: pci_endpoint_test: Add remote eDMA transfer test mode selftests: pci_endpoint: Add remote eDMA transfer coverage Documentation/PCI/endpoint/pci-vntb-howto.rst | 19 +- Documentation/driver-api/ntb.rst | 193 ++ arch/arm64/boot/dts/renesas/Makefile | 2 + .../boot/dts/renesas/r8a779f0-spider-ep.dts | 37 + .../boot/dts/renesas/r8a779f0-spider-rc.dts | 52 + drivers/dma/dw-edma/dw-edma-core.c | 207 +- drivers/dma/dw-edma/dw-edma-core.h | 10 + drivers/dma/dw-edma/dw-edma-v0-core.c | 26 +- drivers/iommu/ipmmu-vmsa.c | 7 +- drivers/misc/pci_endpoint_test.c | 633 +++++ drivers/net/ntb_netdev.c | 341 ++- drivers/ntb/Kconfig | 13 + drivers/ntb/Makefile | 2 + drivers/ntb/core.c | 68 + drivers/ntb/hw/Kconfig | 1 + drivers/ntb/hw/Makefile | 1 + drivers/ntb/hw/edma/Kconfig | 28 + drivers/ntb/hw/edma/Makefile | 5 + drivers/ntb/hw/edma/backend.c | 87 + drivers/ntb/hw/edma/backend.h | 102 + drivers/ntb/hw/edma/ntb_dw_edma.c | 977 +++++++ drivers/ntb/hw/epf/ntb_hw_epf.c | 199 +- drivers/ntb/ntb_transport.c | 2458 +--------------- drivers/ntb/ntb_transport_core.c | 2523 +++++++++++++++++ drivers/ntb/ntb_transport_edma.c | 1110 ++++++++ drivers/ntb/ntb_transport_internal.h | 261 ++ drivers/pci/controller/dwc/pcie-designware.c | 26 + drivers/pci/endpoint/functions/pci-epf-test.c | 497 +++- drivers/pci/endpoint/functions/pci-epf-vntb.c | 380 ++- include/linux/dma/edma.h | 106 + include/linux/ntb.h | 88 + include/uapi/linux/pcitest.h | 3 +- .../pci_endpoint/pci_endpoint_test.c | 17 + 33 files changed, 7855 insertions(+), 2624 deletions(-) create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts create mode 100644 drivers/ntb/hw/edma/Kconfig create mode 100644 drivers/ntb/hw/edma/Makefile create mode 100644 drivers/ntb/hw/edma/backend.c create mode 100644 drivers/ntb/hw/edma/backend.h create mode 100644 drivers/ntb/hw/edma/ntb_dw_edma.c create mode 100644 drivers/ntb/ntb_transport_core.c create mode 100644 drivers/ntb/ntb_transport_edma.c create mode 100644 drivers/ntb/ntb_transport_internal.h -- 2.51.0
