> From: Feifei Wang [mailto:feifei.wa...@arm.com]
> Sent: Wednesday, 4 January 2023 08.31
> 
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.

I am starting to wonder if we have been adding unnecessary feature creep in 
order to make this feature too generic.

Could you please describe some of the most important high-volume use cases from 
real life? It would help setting the scope correctly.

> 
> However, this solution poses several constraints:
> 
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
> 
> If the NIC has 2 ports (there are several references), then we will have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.
> 
> 2)The other factor that needs to be considered is 'run-to-completion' vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at this
> point.
> 
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number.
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
> 
> Current status in this patch:
> 1)Two APIs are added for users to enable direct-rearm mode:
>   In control plane, users can call 'rte_eth_rx_queue_rearm_data_get'
>   to get Rx sw_ring pointer and its rxq_info.
>   (This avoid Tx load Rx data directly);
> 
>   In data plane, users can  call 'rte_eth_dev_direct_rearm' to rearm Rx
>   buffers and free Tx buffers at the same time. Specifically, in this
>   API, there are two separated API for Rx and Tx.
>   For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx freed
> buffers.
>   For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based
>   on the rearm buffers.
>   Thus, this can separate Rx and Tx operation, and user can even re-arm
>   RX queue not from the same driver's TX queue, but from different
>   sources too.
> -----------------------------------------------------------------------
>   control plane:
>       rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data);
>   data plane:
>       loop {
>               rte_eth_dev_direct_rearm(*rxq_rearm_data){
> 
>                       rte_eth_tx_fill_sw_ring{
>                               for (i = 0; i <= 32; i++) {
>                                               sw_ring.mbuf[i] = tx.mbuf[i];
>                               }
>                       }
> 
>                       rte_eth_rx_flush_descriptor{
>                               for (i = 0; i <= 32; i++) {
>                                       flush descs[i];
>                               }
>                       }
>               }
>               rte_eth_rx_burst;
>               rte_eth_tx_burst;
>       }
> -----------------------------------------------------------------------
> 2)The i40e driver is changed to do the direct re-arm of the receive
>   side.
> 3)The ixgbe driver is changed to do the direct re-arm of the receive
>   side.
> 
> Testing status:
> (1) dpdk l3fwd test with multiple drivers:
>     port 0: 82599 NIC   port 1: XL710 NIC
> -------------------------------------------------------------
>               Without fast free       With fast free
> Thunderx2:      +9.44%                        +7.14%
> -------------------------------------------------------------
> 
> (2) dpdk l3fwd test with same driver:
>     port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> *Direct rearm with exposing rx_sw_ring:
>               Without fast free       With fast free
> Ampere altra:   +14.98%                       +15.77%
> n1sdp:                +6.47%                  +0.52%
> -------------------------------------------------------------
> 
> (3) VPP test with same driver:
>     port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> *Direct rearm with exposing rx_sw_ring:
> Ampere altra:     +4.59%
> n1sdp:                  +5.4%
> -------------------------------------------------------------
> 
> Reference:
> [1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-
> adapters/100gb-nic-ocp/n1100g
> 
> V2:
> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
> 
> V3:
> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
> 2. Delete L3fwd change for direct rearm (Jerin)
> 3. enable direct rearm in ixgbe driver in Arm
> 
> Feifei Wang (3):
>   ethdev: enable direct rearm with separate API
>   net/i40e: enable direct rearm with separate API
>   net/ixgbe: enable direct rearm with separate API
> 
>  drivers/net/i40e/i40e_ethdev.c            |   1 +
>  drivers/net/i40e/i40e_ethdev.h            |   2 +
>  drivers/net/i40e/i40e_rxtx.c              |  19 +++
>  drivers/net/i40e/i40e_rxtx.h              |   4 +
>  drivers/net/i40e/i40e_rxtx_vec_common.h   |  54 +++++++
>  drivers/net/i40e/i40e_rxtx_vec_neon.c     |  42 ++++++
>  drivers/net/ixgbe/ixgbe_ethdev.c          |   1 +
>  drivers/net/ixgbe/ixgbe_ethdev.h          |   3 +
>  drivers/net/ixgbe/ixgbe_rxtx.c            |  19 +++
>  drivers/net/ixgbe/ixgbe_rxtx.h            |   4 +
>  drivers/net/ixgbe/ixgbe_rxtx_vec_common.h |  48 ++++++
>  drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c   |  52 +++++++
>  lib/ethdev/ethdev_driver.h                |  10 ++
>  lib/ethdev/ethdev_private.c               |   2 +
>  lib/ethdev/rte_ethdev.c                   |  52 +++++++
>  lib/ethdev/rte_ethdev.h                   | 174 ++++++++++++++++++++++
>  lib/ethdev/rte_ethdev_core.h              |  11 ++
>  lib/ethdev/version.map                    |   6 +
>  18 files changed, 504 insertions(+)
> 
> --
> 2.25.1
> 

Reply via email to