> > > > > > > > > > > > > > In current DPDK framework, each RX queue is > > > > > > > > > > > > > > pre-loaded with mbufs > > > > > > > > > > > > > > for incoming packets. When number of > > > > > > > > > > > > > > representors scale out in a > > > > > > > > > > > > > > switch domain, the memory consumption became > > > > > > > > > > > > > > significant. Most > > > > > > > > > > > > > > important, polling all ports leads to high > > > > > > > > > > > > > > cache miss, high > > > > > > > > > > > > > > latency and low throughput. > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports > > > > > > > > > > > > > > with same > > > > > > > > > > > > > > configuration in a switch domain could share > > > > > > > > > > > > > > RX queue set by specifying sharing group. > > > > > > > > > > > > > > Polling any queue using same shared RX queue > > > > > > > > > > > > > > receives packets from > > > > > > > > > > > > > > all member ports. Source port is identified > > > > > > > > > > > > > > by mbuf->port. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Port queue number in a shared group should be > > > > > > > > > > > > > > identical. Queue > > > > > > > > > > > > > > index is > > > > > > > > > > > > > > 1:1 mapped in shared group. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on > > > > > > > > > > > > > > same thread. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID. > > > > > > > > > > > > > > > > > > > > > > > > > > Is this offload specific to the representor? If > > > > > > > > > > > > > so can this name be changed specifically to > > > > > > > > > > > > > representor? > > > > > > > > > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could > > > > > > > > > > > > take advantage. > > > > > > > > > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow > > > > > > > > > > > > > ordering will be maintained? > > > > > > > > > > > > > > > > > > > > > > > > Not quite sure that I understood your question. > > > > > > > > > > > > The control path of is > > > > > > > > > > > > almost same as before, PF and representor port > > > > > > > > > > > > still needed, rte flows not impacted. > > > > > > > > > > > > Queues still needed for each member port, > > > > > > > > > > > > descriptors(mbuf) will be > > > > > > > > > > > > supplied from shared Rx queue in my PMD > > > > > > > > > > > > implementation. > > > > > > > > > > > > > > > > > > > > > > My question was if create a generic > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple > > > > > > > > > > > ethdev receive queues land into > > > the same > > > > > > > > > > > receive queue, In that case, how the flow order is > > > > > > > > > > > maintained for respective receive queues. > > > > > > > > > > > > > > > > > > > > I guess the question is testpmd forward stream? The > > > > > > > > > > forwarding logic has to be changed slightly in case > > > > > > > > > > of shared rxq. > > > > > > > > > > basically for each packet in rx_burst result, lookup > > > > > > > > > > source stream according to mbuf->port, forwarding to > > > > > > > > > > target fs. > > > > > > > > > > Packets from same source port could be grouped as a > > > > > > > > > > small burst to process, this will accelerates the > > > > > > > > > > performance if traffic > > > come from > > > > > > > > > > limited ports. I'll introduce some common api to do > > > > > > > > > > shard rxq forwarding, call it with packets handling > > > > > > > > > > callback, so it suites for > > > > > > > > > > all forwarding engine. Will sent patches soon. > > > > > > > > > > > > > > > > > > > > > > > > > > > > All ports will put the packets in to the same queue > > > > > > > > > (share queue), right? Does > > > > > > > > > this means only single core will poll only, what will > > > > > > > > > happen if there are > > > > > > > > > multiple cores polling, won't it cause problem? > > > > > > > > > > > > > > > > > > And if this requires specific changes in the > > > > > > > > > application, I am not sure about > > > > > > > > > the solution, can't this work in a transparent way to > > > > > > > > > the application? > > > > > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that > > > > > > > > aggregate ports > > > > > > > > in same group into one new port. Users could schedule > > > > > > > > polling on the > > > > > > > > aggregated port instead of all member ports. > > > > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO, > > > > > > > For this > > > > > > > feature, we should not change fastpath of testpmd > > > > > > > application. Instead, testpmd can use aggregated ports > > > > > > > probably as > > > > > > > separate fwd_engine to show how to use this feature. > > > > > > > > > > > > Good point to discuss :) There are two strategies to polling > > > > > > a shared > > > > > > Rxq: > > > > > > 1. polling each member port > > > > > > All forwarding engines can be reused to work as before. > > > > > > My testpmd patches are efforts towards this direction. > > > > > > Does your PMD support this? > > > > > > > > > > Not unfortunately. More than that, every application needs to > > > > > change > > > > > to support this model. > > > > > > > > Both strategies need user application to resolve port ID from > > > > mbuf and > > > > process accordingly. > > > > This one doesn't demand aggregated port, no polling schedule > > > > change. > > > > > > I was thinking, mbuf will be updated from driver/aggregator port as > > > when it > > > comes to application. > > > > > > > > > > > > > > > > > > 2. polling aggregated port > > > > > > Besides forwarding engine, need more work to to demo it. > > > > > > This is an optional API, not supported by my PMD yet. > > > > > > > > > > We are thinking of implementing this PMD when it comes to it, > > > > > ie. > > > > > without application change in fastpath > > > > > logic. > > > > > > > > Fastpath have to resolve port ID anyway and forwarding according > > > > to > > > > logic. Forwarding engine need to adapt to support shard Rxq. > > > > Fortunately, in testpmd, this can be done with an abstract API. > > > > > > > > Let's defer part 2 until some PMD really support it and tested, > > > > how do > > > > you think? > > > > > > We are not planning to use this feature so either way it is OK to > > > me. > > > I leave to ethdev maintainers decide between 1 vs 2. > > > > > > I do have a strong opinion not changing the testpmd basic forward > > > engines > > > for this feature.I would like to keep it simple as fastpath > > > optimized and would > > > like to add a separate Forwarding engine as means to verify this > > > feature. > > > > +1 to that. > > I don't think it a 'common' feature. > > So separate FWD mode seems like a best choice to me. > > -1 :) > There was some internal requirement from test team, they need to verify > all features like packet content, rss, vlan, checksum, rte_flow... to > be working based on shared rx queue.
Then I suppose you'll need to write really comprehensive fwd-engine to satisfy your test team :) Speaking seriously, I still don't understand why do you need all available fwd-engines to verify this feature. From what I understand, main purpose of your changes to test-pmd: allow to fwd packet though different fwd_stream (TX through different HW queue). In theory, if implemented in generic and extendable way - that might be a useful add-on to tespmd fwd functionality. But current implementation looks very case specific. And as I don't think it is a common case, I don't see much point to pollute basic fwd cases with it. BTW, as a side note, the code below looks bogus to me: +void +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx, + struct rte_mbuf **pkts_burst, packet_fwd_cb fwd) +{ + uint16_t i, nb_fs_rx = 1, port; + + /* Locate real source fs according to mbuf->port. */ + for (i = 0; i < nb_rx; ++i) { + rte_prefetch0(pkts_burst[i + 1]); you access pkt_burst[] beyond array boundaries, also you ask cpu to prefetch some unknown and possibly invalid address. > Based on the patch, I believe the > impact has been minimized. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Overall, is this for optimizing memory for the port > > > > > > > > > represontors? If so can't we > > > > > > > > > have a port representor specific solution, reducing > > > > > > > > > scope can reduce the > > > > > > > > > complexity it brings? > > > > > > > > > > > > > > > > > > > > If this offload is only useful for representor > > > > > > > > > > > case, Can we make this offload specific to > > > > > > > > > > > representor the case by changing its > > > name and > > > > > > > > > > > scope. > > > > > > > > > > > > > > > > > > > > It works for both PF and representors in same switch > > > > > > > > > > domain, for application like OVS, few changes to > > > > > > > > > > apply. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li > > > > > > > > > > > > > > <xuemi...@nvidia.com> > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > doc/guides/nics/features.rst > > > > > > > > > > > > > > | 11 +++++++++++ > > > > > > > > > > > > > > doc/guides/nics/features/default.ini > > > > > > > > > > > > > > | 1 + > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation. > > > > > > > > > > > > > > rst | 10 ++++++++++ > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > | 1 + > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > | 7 +++++++ > > > > > > > > > > > > > > 5 files changed, 30 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644 > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 > > > > > > > > > > > > > > checksum. > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue: > > > > > > > > > > > > > > + > > > > > > > > > > > > > > +Shared Rx queue > > > > > > > > > > > > > > +--------------- > > > > > > > > > > > > > > + > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same > > > > > > > > > > > > > > switch domain. > > > > > > > > > > > > > > + > > > > > > > > > > > > > > +* **[uses] > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**: > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``. > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``. > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Packet type parsing > > > > > > > > > > > > > > diff --git > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644 > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand = > > > > > > > > > > > > > > Queue start/stop = > > > > > > > > > > > > > > Runtime Rx queue setup = > > > > > > > > > > > > > > Runtime Tx queue setup = > > > > > > > > > > > > > > +Shared Rx queue = > > > > > > > > > > > > > > Burst mode info = > > > > > > > > > > > > > > Power mgmt address monitor = > > > > > > > > > > > > > > MTU update = > > > > > > > > > > > > > > diff --git > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644 > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > +++ > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation > > > > > > > > > > > > > > .rst > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software > > > > > > > > > > > > > > "patch panel" front-end for applications. > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model > > > > > > > > > > > > > > (switchdev) > > > > > > > > > > > > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net > > > > > > > > > > > > > > working/switchdev.txt > > > > > > > > > > > > > > > `_ > > > > > > > > > > > > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when > > > > > > > > > > > > > > number of representor > > > > > > > > > > > > > > +grows, > > > > > > > > > > > > > > + because PMD always allocate mbuf for each > > > > > > > > > > > > > > descriptor of Rx queue. > > > > > > > > > > > > > > + Polling the large number of ports brings > > > > > > > > > > > > > > more CPU load, cache > > > > > > > > > > > > > > +miss and > > > > > > > > > > > > > > + latency. Shared Rx queue can be used to > > > > > > > > > > > > > > share Rx queue between > > > > > > > > > > > > > > +PF and > > > > > > > > > > > > > > + representors in same switch domain. > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` > > > > > > > > > > > > > > + is present in Rx offloading capability of > > > > > > > > > > > > > > device info. Setting > > > > > > > > > > > > > > +the > > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx > > > > > > > > > > > > > > queue configuration to > > > > > > > > > > > > > > +enable > > > > > > > > > > > > > > + shared Rx queue. Polling any member port > > > > > > > > > > > > > > of shared Rx queue can > > > > > > > > > > > > > > +return > > > > > > > > > > > > > > + packets of all ports in group, port ID is > > > > > > > > > > > > > > saved in ``mbuf.port``. > > > > > > > > > > > > > > + > > > > > > > > > > > > > > Basic SR-IOV > > > > > > > > > > > > > > ------------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644 > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct { > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU > > > > > > > > > > > > > > M), > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH), > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL > > > > > > > > > > > > > > IT), > > > > > > > > > > > > > > + > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ), > > > > > > > > > > > > > > }; > > > > > > > > > > > > > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644 > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf { > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets > > > > > > > > > > > > > > if no descriptors are available. */ > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */ > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of > > > > > > > > > > > > > > descriptions in rx_seg array. > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared > > > > > > > > > > > > > > port group index in > > > > > > > > > > > > > > + switch domain. */ > > > > > > > > > > > > > > /** > > > > > > > > > > > > > > * Per-queue Rx offloads to be set > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags. > > > > > > > > > > > > > > * Only offloads set on > > > > > > > > > > > > > > rx_queue_offload_capa or > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct > > > > > > > > > > > > > > rte_eth_conf { > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM > > > > > > > > > > > > > > 0x00040000 > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH > > > > > > > > > > > > > > 0x00080000 > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT > > > > > > > > > > > > > > 0x00100000 > > > > > > > > > > > > > > +/** > > > > > > > > > > > > > > + * Rx queue is shared among ports in same > > > > > > > > > > > > > > switch domain to save > > > > > > > > > > > > > > +memory, > > > > > > > > > > > > > > + * avoid polling each port. Any port in > > > > > > > > > > > > > > group can be used to receive packets. > > > > > > > > > > > > > > + * Real source port number saved in mbuf- > > > > > > > > > > > > > > >port field. > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ > > > > > > > > > > > > > > 0x00200000 > > > > > > > > > > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \ > > > > > > > > > > > > > > DEV_RX_OFFLO > > > > > > > > > > > > > > AD_UDP_CKSUM | \ > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > 2.25.1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >