On 1/18/26 6:54 AM, Koichiro Den wrote:
> Hi,
>
> This is RFC v4 of the NTB/PCI/dmaengine series that introduces an
> optional NTB transport variant where payload data is moved by a PCI
> embedded-DMA engine (eDMA) residing on the endpoint side.
Just a fly by comment. This series is huge. I do suggest break it down to
something more manageable to prevent review fatigue from patch reviewers. For
example, linux network sub-system has a rule to restrict patch series to no
more than 15 patches. NTB sub-system does not have that rule. But maybe split
out the dmaengine changes and the hardware specific dw-edma bits from the ntb
core changes.
DJ
>
> The primary target is Synopsys DesignWare PCIe endpoint controllers that
> integrate a DesignWare eDMA instance (dw-edma). In the remote
> embedded-DMA mode, payload is transferred by DMA directly between the
> two systems' memory, and NTB Memory Windows are used primarily for
> control/metadata and for exposing the endpoint eDMA resources (register
> window + linked-list rings) to the host.
>
> Compared to the existing cpu/dma memcpy-based implementation, this
> approach avoids window-backed payload rings and the associated extra
> copies, and it is less sensitive to scarce MW space. This also enables
> scaling out to multiple queue pairs, which is particularly beneficial
> for ntb_netdev. On R-Car S4, preliminary iperf3 results show 10~20x
> throughput improvement. Latency improvements are also observed.
>
> RFC history:
> RFC v3:
> https://lore.kernel.org/all/[email protected]/
> RFC v2:
> https://lore.kernel.org/all/[email protected]/
> RFC v1:
> https://lore.kernel.org/all/[email protected]/
>
> Parts of RFC v3 series have already been split out and posted separately
> (see "Kernel base / dependencies" section below). However, feedback on
> the remaining parts led to substantial restructuring and code changes,
> so I am sending an RFC v4 as a refreshed version of the full series.
>
> RFC v4 is still a large, cross-subsystem series. At this RFC stage,
> I am sending the full picture in a single set to make it easier to
> review the overall direction and architecture. Once the direction is
> agreed upon and no further large restructuring appears necessary, I will stop
> posting the new RFC-tagged revisions and continue development on
> separate threads, split by sub-topic.
>
> Many thanks for all the reviews and feedback from multiple perspectives.
>
>
> Software architecture overview (RFC v4)
> =======================================
>
> A major change in RFC v4 is the software layering and module split.
>
> The existing memcpy-based transport and the new remote embedded-DMA
> transport are implemented as two independent NTB client drivers on top
> of a shared core library:
>
> +--------------------+
> | ntb_transport_core |
> +--------------------+
> ^ ^
> | |
> ntb_transport -----+ +----- ntb_transport_edma
> (cpu/dma memcpy) (remote embedded DMA transfer)
> |
> v
> +-----------+
> | ntb_edma |
> +-----------+
> ^
> |
> +----------------+
> | |
> ntb_dw_edma [...]
>
> Key points:
> * ntb_transport_core provides the queue-pair abstraction used by upper
> layer clients (e.g. ntb_netdev).
> * ntb_transport is the legacy shared-memory transport client (CPU/DMA
> memcpy).
> * ntb_transport_edma is the remote embedded-DMA transport client.
> * ntb_transport_edma relies on an ntb_edma backend registry.
> This RFC provides an initial DesignWare backend (ntb_dw_edma).
> * Transport selection is per-NTB device via the standard
> driver_override mechanism. To enable that, this RFC adds
> driver_override support to ntb_bus. This allows mixing transports
> across multiple NTB ports and provides an explicit fallback path to
> the legacy transport.
>
> So, if ntb_transport / ntb_transport_edma are built as loadable modules,
> you can just run modprobe ntb_transport as before and the original cpu/dma
> memcpy-based implementation will be active. If they are built-in, whether
> ntb_transport or ntb_transport_edma are bound by default depends on
> initcall order. Regarding how to switch the driver, please see Patch 34
> ("Documentation: driver-api: ntb: Document remote embedded-DMA transport")
> for details.
>
>
> Data flow overview (remote embedded-DMA transport)
> ==================================================
>
> At a high level:
> * One MW is reserved as an "eDMA window". The endpoint exposes the
> eDMA register block plus LL descriptor rings through that window, so
> the peer can ioremap it and drive DMA reads remotely.
> * Remaining MWs carry only small control-plane rings used to exchange
> buffer addresses and completion information.
> * For RC->EP traffic, the RC drives endpoint DMA read channels through
> the peer-visible eDMA window.
> * For EP->RC traffic, the endpoint uses its local DMA write channels.
>
> The following figures illustrate the data flow when ntb_netdev sits on
> top of the transport:
>
> Figure 1. RC->EP traffic via ntb_netdev + ntb_transport_edma
> backed by ntb_edma/ntb_dw_edma
>
> EP RC
> phys addr phys addr
> space space
> +-+ +-+
> | | | |
> | | || | |
> +-+-----. || | |
> EDMA REG | | \ [A] || | |
> +-+----. '---+-+ || | |
> | | \ | |<---------[0-a]----------
> +-+-----------| |<----------[2]----------.
> EDMA LL | | | | || | | :
> | | | | || | | :
> +-+-----------+-+ || [B] | | :
> | | || ++ | | :
> ---------[0-b]----------->||----------------'
> | | ++ || || | |
> | | || || ++ | |
> | | ||<----------[4]-----------
> | | ++ || | |
> | | [C] || | |
> .--|#|<------------------------[3]------|#|<-.
> : |#| || |#| :
> [5] | | || | | [1]
> : | | || | | :
> '->|#| |#|--'
> |#| |#|
> | | | |
>
> Figure 2. EP->RC traffic via ntb_netdev + ntb_transport_edma
> backed by ntb_edma/ntb_dw_edma
>
> EP RC
> phys addr phys addr
> space space
> +-+ +-+
> | | | |
> | | || | |
> +-+ || | |
> EDMA REG | | || | |
> +-+ || | |
> ^ | | || | |
> : +-+ || | |
> : EDMA LL| | || | |
> : | | || | |
> : +-+ || [C] | |
> : | | || ++ | |
> : -----------[4]----------->|| | |
> : | | ++ || || | |
> : | | || || ++ | |
> '----------------[2]-----||<--------[0-b]-----------
> | | ++ || | |
> | | [B] || | |
> .->|#|--------[3]---------------------->|#|--.
> : |#| || |#| :
> [1] | | || | | [5]
> : | | || | | :
> '--|#| |#|<-'
> |#| |#|
> | | | |
>
> 0-a. configure remote embedded DMA (program endpoint DMA registers)
> 0-b. DMA-map and publish destination address (DAR)
> 1. network stack builds skb (copy from application/user memory)
> 2. consume DAR, DMA-map source address (SAR) and kick DMA transfer
> 3. DMA transfer (payload moves between RC/EP memory)
> 4. consume completion (commit)
> 5. network stack delivers data to application/user memory
>
> [A]: Dedicated MW that aggregates DMA regs and LL (peer ioremaps it)
> [B]: Control-plane ring buffer for "produce"
> [C]: Control-plane ring buffer for "consume"
>
>
> Kernel base / dependencies
> ==========================
>
> This series is based on:
>
> - next-20260114 (commit b775e489bec7)
>
> plus the following seven unmerged patch series or standalone patches:
>
> - [PATCH v4 0/7] PCI: endpoint/NTB: Harden vNTB resource management
> https://lore.kernel.org/all/[email protected]/
>
> - [PATCH v2 0/2] NTB: ntb_transport: debugfs cleanups
> https://lore.kernel.org/all/[email protected]/
>
> - [PATCH v3 0/9] dmaengine: Add new API to combine configuration and
> descriptor preparation
>
> https://lore.kernel.org/all/[email protected]/
>
> - [PATCH v8 0/5] PCI: endpoint: BAR subrange mapping support
> https://lore.kernel.org/all/[email protected]/
>
> - [PATCH] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on
> mws_size[] access
> https://lore.kernel.org/all/[email protected]/
>
> - [PATCH] dmaengine: dw-edma: Fix MSI data values for multi-vector IMWr
> interrupts
> https://lore.kernel.org/all/[email protected]/
>
> - [PATCH v2 01/11] dmaengine: dw-edma: Add spinlock to protect
> DONE_INT_MASK and ABORT_INT_MASK
> https://lore.kernel.org/imx/[email protected]/
> (only this single commit is cherry-picked from the series)
>
>
> Patch layout
> ============
>
> 1. dw-edma / DesignWare EP helpers needed for remote embedded-DMA (export
> register/LL windows, IRQ routing control, etc.)
>
> Patch 01 : dmaengine: dw-edma: Export helper to get integrated register
> window
> Patch 02 : dmaengine: dw-edma: Add per-channel interrupt routing control
> Patch 03 : dmaengine: dw-edma: Poll completion when local IRQ handling
> is disabled
> Patch 04 : dmaengine: dw-edma: Add notify-only channels support
> Patch 05 : dmaengine: dw-edma: Add a helper to query linked-list region
>
> 2. NTB EPF/core + vNTB prep (mwN_offset + versioning, MSI vector
> management, new ntb_dev_ops helpers, driver_override, vntb glue)
>
> Patch 06 : NTB: epf: Add mwN_offset support and config region versioning
> Patch 07 : NTB: epf: Reserve a subset of MSI vectors for non-NTB users
> Patch 08 : NTB: epf: Provide db_vector_count/db_vector_mask callbacks
> Patch 09 : NTB: core: Add mw_set_trans_ranges() for subrange programming
> Patch 10 : NTB: core: Add .get_private_data() to ntb_dev_ops
> Patch 11 : NTB: core: Add .get_dma_dev() to ntb_dev_ops
> Patch 12 : NTB: core: Add driver_override support for NTB devices
> Patch 13 : PCI: endpoint: pci-epf-vntb: Support BAR subrange mappings
> for MWs
> Patch 14 : PCI: endpoint: pci-epf-vntb: Implement .get_private_data()
> callback
> Patch 15 : PCI: endpoint: pci-epf-vntb: Implement .get_dma_dev()
>
> 3. ntb_transport refactor/modularization and backend infrastructure
>
> Patch 16 : NTB: ntb_transport: Move TX memory window setup into
> setup_qp_mw()
> Patch 17 : NTB: ntb_transport: Dynamically determine qp count
> Patch 18 : NTB: ntb_transport: Use ntb_get_dma_dev()
> Patch 19 : NTB: ntb_transport: Rename ntb_transport.c to
> ntb_transport_core.c
> Patch 20 : NTB: ntb_transport: Move internal types to
> ntb_transport_internal.h
> Patch 21 : NTB: ntb_transport: Export common helpers for modularization
> Patch 22 : NTB: ntb_transport: Split core library and default NTB client
> Patch 23 : NTB: ntb_transport: Add transport backend infrastructure
> Patch 24 : NTB: ntb_transport: Run ntb_set_mw() before link-up
> negotiation
>
> 4. ntb_edma backend registry + DesignWare backend + transport client
>
> Patch 25 : NTB: hw: Add remote eDMA backend registry and DesignWare
> backend
> Patch 26 : NTB: ntb_transport: Add remote embedded-DMA transport client
>
> 5. ntb_netdev multi-queue support
>
> Patch 27 : ntb_netdev: Multi-queue support
>
> 6. Renesas R-Car S4 enablement (IOMMU, DTs, quirks)
>
> Patch 28 : iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
> Patch 29 : iommu: ipmmu-vmsa: Add support for reserved regions
> Patch 30 : arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote
> DW PCIe eDMA
> Patch 31 : NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B
> for R-Car)
> Patch 32 : NTB: epf: Add an additional memory window (MW2) barno mapping
> on Renesas R-Car
>
> 7. Documentation updates
>
> Patch 33 : Documentation: PCI: endpoint: pci-epf-vntb: Update and add
> mwN_offset usage
> Patch 34 : Documentation: driver-api: ntb: Document remote embedded-DMA
> transport
>
> 8. pci-epf-test / pci_endpoint_test / kselftest coverage for remote eDMA
>
> Patch 35 : PCI: endpoint: pci-epf-test: Add pci_epf_test_next_free_bar()
> helper
> Patch 36 : PCI: endpoint: pci-epf-test: Add remote eDMA-backed mode
> Patch 37 : misc: pci_endpoint_test: Add remote eDMA transfer test mode
> Patch 38 : selftests: pci_endpoint: Add remote eDMA transfer coverage
>
>
> Tested on
> =========
>
> * 2x Renesas R-Car S4 Spider (RC<->EP connected with OCuLink cable)
> * Kernel base as described above
>
>
> Performance notes
> =================
>
> The primary motivation remains improving throughput/latency for ntb_transport
> users (typically ntb_netdev). On R-Car S4, the earlier prototype (RFC v3)
> showed roughly 10-20x throughput improvement in preliminary iperf3 tests and
> lower ping RTT. I have not yet re-measured after the v4 refactor and
> module split.
>
>
> Changelog
> =========
>
> RFCv3->RFCv4 changes:
> - Major refactor of the transport layering:
> - Introduce ntb_transport_core as a shared library module.
> - Split the legacy shared-memory transport client (ntb_transport) and the
> remote embedded-DMA transport client (ntb_transport_edma).
> - Add driver_override support for ntb_bus and use it for per-port
> transport
> selection.
> - Introduce a vendor-agnostic remote embedded-DMA backend registry
> (ntb_edma)
> and add the initial DesignWare backend (ntb_dw_edma).
> - Rebase to next-20260114 and move several prerequisite/fixup patchsets into
> separate threads (listed above), including BAR subrange mapping support
> and
> dw-edma fixes.
> - Add PCI endpoint test coverage for the remote embedded-DMA path:
> - extend pci-epf-test / pci_endpoint_test
> - add a kselftest variant to exercise remote-eDMA transfers
> Note: to keep the changes as small as possible, I added a few #ifdefs
> in the main test code. Feedback on whether/how/to what extent this
> should be split into separate modules would be appreciated.
> - Expand documentation (Documentation/driver-api/ntb.rst) to describe
> transport
> variants, the new module structure, and the remote embedded-DMA data flow.
> - Addressed other feedbacks from the RFC v3 thread.
>
> RFCv2->RFCv3 changes:
> - Architecture
> - Have EP side use its local write channels, while leaving RC side to
> use remote read channels.
> - Abstraction/HW-specific stuff encapsulation improved.
> - Added control/config region versioning for the vNTB/EPF control region
> so that mismatched RC/EP kernels fail early instead of silently using an
> incompatible layout.
> - Reworked BAR subrange / multi-region mapping support:
> - Dropped the v2 approach that added new inbound mapping ops in the EPC
> core.
> - Introduced `struct pci_epf_bar.submap` and extended DesignWare EP to
> support BAR subrange inbound mapping via Address Match Mode IB iATU.
> - pci-epf-vntb now provides a subrange mapping hint to the EPC driver
> when offsets are used.
> - Changed .get_pci_epc() to .get_private_data()
> - Dropped two commits from RFC v2 that should be submitted separately:
> (1) ntb_transport debugfs seq_file conversion
> (2) DWC EP outbound iATU MSI mapping/cache fix (will be re-posted
> separately)
> - Added documentation updates.
> - Addressed assorted review nits from the RFC v2 thread (naming/structure).
>
> RFCv1->RFCv2 changes:
> - Architecture
> - Drop the generic interrupt backend + DW eDMA test-interrupt backend
> approach and instead adopt the remote eDMA-backed ntb_transport mode
> proposed by Frank Li. The BAR-sharing / mwN_offset / inbound
> mapping (Address Match Mode) infrastructure from RFC v1 is largely
> kept, with only minor refinements and code motion where necessary
> to fit the new transport-mode design.
> - For Patch 01
> - Rework the array_index_nospec() conversion to address review
> comments on "[RFC PATCH 01/25]".
>
> RFCv3: https://lore.kernel.org/all/[email protected]/
> RFCv2: https://lore.kernel.org/all/[email protected]/
> RFCv1: https://lore.kernel.org/all/[email protected]/
>
> Thank you for reviewing,
>
>
> Koichiro Den (38):
> dmaengine: dw-edma: Export helper to get integrated register window
> dmaengine: dw-edma: Add per-channel interrupt routing control
> dmaengine: dw-edma: Poll completion when local IRQ handling is
> disabled
> dmaengine: dw-edma: Add notify-only channels support
> dmaengine: dw-edma: Add a helper to query linked-list region
> NTB: epf: Add mwN_offset support and config region versioning
> NTB: epf: Reserve a subset of MSI vectors for non-NTB users
> NTB: epf: Provide db_vector_count/db_vector_mask callbacks
> NTB: core: Add mw_set_trans_ranges() for subrange programming
> NTB: core: Add .get_private_data() to ntb_dev_ops
> NTB: core: Add .get_dma_dev() to ntb_dev_ops
> NTB: core: Add driver_override support for NTB devices
> PCI: endpoint: pci-epf-vntb: Support BAR subrange mappings for MWs
> PCI: endpoint: pci-epf-vntb: Implement .get_private_data() callback
> PCI: endpoint: pci-epf-vntb: Implement .get_dma_dev()
> NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
> NTB: ntb_transport: Dynamically determine qp count
> NTB: ntb_transport: Use ntb_get_dma_dev()
> NTB: ntb_transport: Rename ntb_transport.c to ntb_transport_core.c
> NTB: ntb_transport: Move internal types to ntb_transport_internal.h
> NTB: ntb_transport: Export common helpers for modularization
> NTB: ntb_transport: Split core library and default NTB client
> NTB: ntb_transport: Add transport backend infrastructure
> NTB: ntb_transport: Run ntb_set_mw() before link-up negotiation
> NTB: hw: Add remote eDMA backend registry and DesignWare backend
> NTB: ntb_transport: Add remote embedded-DMA transport client
> ntb_netdev: Multi-queue support
> iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
> iommu: ipmmu-vmsa: Add support for reserved regions
> arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe
> eDMA
> NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
> NTB: epf: Add an additional memory window (MW2) barno mapping on
> Renesas R-Car
> Documentation: PCI: endpoint: pci-epf-vntb: Update and add mwN_offset
> usage
> Documentation: driver-api: ntb: Document remote embedded-DMA transport
> PCI: endpoint: pci-epf-test: Add pci_epf_test_next_free_bar() helper
> PCI: endpoint: pci-epf-test: Add remote eDMA-backed mode
> misc: pci_endpoint_test: Add remote eDMA transfer test mode
> selftests: pci_endpoint: Add remote eDMA transfer coverage
>
> Documentation/PCI/endpoint/pci-vntb-howto.rst | 19 +-
> Documentation/driver-api/ntb.rst | 193 ++
> arch/arm64/boot/dts/renesas/Makefile | 2 +
> .../boot/dts/renesas/r8a779f0-spider-ep.dts | 37 +
> .../boot/dts/renesas/r8a779f0-spider-rc.dts | 52 +
> drivers/dma/dw-edma/dw-edma-core.c | 207 +-
> drivers/dma/dw-edma/dw-edma-core.h | 10 +
> drivers/dma/dw-edma/dw-edma-v0-core.c | 26 +-
> drivers/iommu/ipmmu-vmsa.c | 7 +-
> drivers/misc/pci_endpoint_test.c | 633 +++++
> drivers/net/ntb_netdev.c | 341 ++-
> drivers/ntb/Kconfig | 13 +
> drivers/ntb/Makefile | 2 +
> drivers/ntb/core.c | 68 +
> drivers/ntb/hw/Kconfig | 1 +
> drivers/ntb/hw/Makefile | 1 +
> drivers/ntb/hw/edma/Kconfig | 28 +
> drivers/ntb/hw/edma/Makefile | 5 +
> drivers/ntb/hw/edma/backend.c | 87 +
> drivers/ntb/hw/edma/backend.h | 102 +
> drivers/ntb/hw/edma/ntb_dw_edma.c | 977 +++++++
> drivers/ntb/hw/epf/ntb_hw_epf.c | 199 +-
> drivers/ntb/ntb_transport.c | 2458 +---------------
> drivers/ntb/ntb_transport_core.c | 2523 +++++++++++++++++
> drivers/ntb/ntb_transport_edma.c | 1110 ++++++++
> drivers/ntb/ntb_transport_internal.h | 261 ++
> drivers/pci/controller/dwc/pcie-designware.c | 26 +
> drivers/pci/endpoint/functions/pci-epf-test.c | 497 +++-
> drivers/pci/endpoint/functions/pci-epf-vntb.c | 380 ++-
> include/linux/dma/edma.h | 106 +
> include/linux/ntb.h | 88 +
> include/uapi/linux/pcitest.h | 3 +-
> .../pci_endpoint/pci_endpoint_test.c | 17 +
> 33 files changed, 7855 insertions(+), 2624 deletions(-)
> create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
> create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
> create mode 100644 drivers/ntb/hw/edma/Kconfig
> create mode 100644 drivers/ntb/hw/edma/Makefile
> create mode 100644 drivers/ntb/hw/edma/backend.c
> create mode 100644 drivers/ntb/hw/edma/backend.h
> create mode 100644 drivers/ntb/hw/edma/ntb_dw_edma.c
> create mode 100644 drivers/ntb/ntb_transport_core.c
> create mode 100644 drivers/ntb/ntb_transport_edma.c
> create mode 100644 drivers/ntb/ntb_transport_internal.h
>