> From: Konstantin Ananyev [mailto:konstantin.v.anan...@yandex.ru]
> Sent: Saturday, 4 June 2022 01.32
> 
> > <snip>
> >>>
> >>>>
> >>>> [konstantin.v.anan...@yandex.ru appears similar to someone who
> >>>> previously sent you email, but may not be that person. Learn why
> this
> >>>> could be a risk at https://aka.ms/LearnAboutSenderIdentification.]
> >>>>
> >>>> 16/05/2022 07:10, Feifei Wang пишет:
> >>>>>
> >>>>>>> Currently, the transmit side frees the buffers into the lcore
> >>>>>>> cache and the receive side allocates buffers from the lcore
> cache.
> >>>>>>> The transmit side typically frees 32 buffers resulting in
> >>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates
> 32
> >>>>>>> buffers and stores them in the receive side software ring,
> >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> lcore cache.
> >>>>>>>
> >>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
> >>>>>>> from the lcore cache. i.e. the receive side will free the
> buffers
> >>>>>>> from transmit side directly into it's software ring. This will
> >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> cache.
> >>>>>>> It also frees up the cache lines used by the lcore cache.
> >>>>>>>
> >>>>>>> However, this solution poses several constraints:
> >>>>>>>
> >>>>>>> 1)The receive queue needs to know which transmit queue it
> should
> >>>>>>> take the buffers from. The application logic decides which
> >>>>>>> transmit port to use to send out the packets. In many use cases
> >>>>>>> the NIC might have a single port ([1], [2], [3]), in which case
> a
> >>>>>>> given transmit queue is always mapped to a single receive queue
> >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> >>>>>>>
> >>>>>>> If the NIC has 2 ports (there are several references), then we
> >>>>>>> will have
> >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> configure.
> >>>>>>> However, if this is generalized to 'N' ports, the configuration
> >>>>>>> can be long. More over the PMD would have to scan a list of
> >>>>>>> transmit queues to pull the buffers from.
> >>>>>
> >>>>>> Just to re-iterate some generic concerns about this proposal:
> >>>>>>     - We effectively link RX and TX queues - when this feature
> is enabled,
> >>>>>>       user can't stop TX queue without stopping linked RX queue
> first.
> >>>>>>       Right now user is free to start/stop any queues at his
> will.
> >>>>>>       If that feature will allow to link queues from different
> ports,
> >>>>>>       then even ports will become dependent and user will have
> to pay extra
> >>>>>>       care when managing such ports.
> >>>>>
> >>>>> [Feifei] When direct rearm enabled, there are two path for thread
> to
> >>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
> >>>>> from Tx.
> >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus, users
> do
> >>>>> not need to pay much attention managing ports.
> >>>>
> >>>> What I am talking about: right now different port or different
> queues
> >>>> of the same port can be treated as independent entities:
> >>>> in general user is free to start/stop (and even reconfigure in
> some
> >>>> cases) one entity without need to stop other entity.
> >>>> I.E user can stop and re-configure TX queue while keep receiving
> >>>> packets from RX queue.
> >>>> With direct re-arm enabled, I think it wouldn't be possible any
> more:
> >>>> before stopping/reconfiguring TX queue user would have make sure
> that
> >>>> corresponding RX queue wouldn't be used by datapath.
> >>> I am trying to understand the problem better. For the TX queue to
> be stopped,
> >> the user must have blocked the data plane from accessing the TX
> queue.
> >>
> >> Surely it is user responsibility tnot to call tx_burst() for
> stopped/released queue.
> >> The problem is that while TX for that queue is stopped, RX for
> related queue still
> >> can continue.
> >> So rx_burst() will try to read/modify TX queue data, that might be
> already freed,
> >> or simultaneously modified by control path.
> > Understood, agree on the issue
> >
> >>
> >> Again, it all can be mitigated by carefully re-designing and
> modifying control and
> >> data-path inside user app - by doing extra checks and
> synchronizations, etc.
> >> But from practical point - I presume most of users simply would
> avoid using this
> >> feature due all potential problems it might cause.
> > That is subjective, it all depends on the performance improvements
> users see in their application.
> > IMO, the performance improvement seen with this patch is worth few
> changes.
> 
> Yes, it is subjective till some extent, though my feeling
> that it might end-up being sort of synthetic improvement used only
> by some show-case benchmarks.

I believe that one specific important use case has already been mentioned, so I 
don't think this is a benchmark only feature.

>  From my perspective, it would be much more plausible,
> if we can introduce some sort of generic improvement, that doesn't
> impose all these extra constraints and implications.
> Like one, discussed below in that thread with ZC mempool approach.
> 

Considering this feature from a high level perspective, I agree with 
Konstantin's concerns, so I'll also support his views.

If this patch is supposed to be a generic feature, please add support for it in 
all NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 
mapping or N:M mapping.) It is purely software, so it should be available for 
all PMDs, not just your favorite hardware! Consider the "fast mbuf free" 
feature, which is pure software; why is that feature not implemented in all 
PMDs?

A secondary point I'm making here is that this specific feature will lead to an 
enormous amount of copy-paste code, instead of a generic library function 
easily available for all PMDs.

> 
> >>
> >>> Like Feifei says, the RX side has the normal packet allocation path
> still available.
> >>> Also this sounds like a corner case to me, we can handle this
> through checks in
> >> the queue_stop API.
> >>
> >> Depends.
> >> if it would be allowed to link queues only from the same port, then
> yes, extra
> >> checks for queue-stop might be enough.
> >> As right now DPDK doesn't allow user to change number of queues
> without
> >> dev_stop() first.
> >> Though if it would be allowed to link queues from different ports,
> then situation
> >> will be much worse.
> >> Right now ports are totally independent entities (except some
> special cases like
> >> link-bonding, etc.).
> >> As one port can keep doing RX/TX, second one can be stopped, re-
> confgured,
> >> even detached, and newly attached device might re-use same port
> number.
> > I see this as a similar restriction to the one discussed above.
> 
> Yes, they are similar in principal, though I think that the case
> with queues from different port would make things much more complex.
> 
>  > Do you see any issues if we enforce this with checks?
> 
> Hard to tell straightway, a lot will depend how smart
> such implementation would be.
> Usually DPDK tends not to avoid heavy
> synchronizations within its data-path functions.

Certainly! Implementing more and more of such features in the PMDs will lead to 
longer and longer data plane code paths in the PMDs. It is the "salami method", 
where each small piece makes no performance difference, but they all add up, 
and eventually the sum of them does impact the performance of the general use 
case negatively.

> 
> >>
> >>
> >>>>
> >>>>>
> >>>>>> - very limited usage scenario - it will have a positive effect
> only
> >>>>>>      when we have a fixed forwarding mapping: all (or nearly
> all) packets
> >>>>>>      from the RX queue are forwarded into the same TX queue.
> >>>>>
> >>>>> [Feifei] Although the usage scenario is limited, this usage
> scenario
> >>>>> has a wide range of applications, such as NIC with one port.
> >>>>
> >>>> yes, there are NICs with one port, but no guarantee there wouldn't
> be
> >>>> several such NICs within the system.
> >>> What I see in my interactions is, a single NIC/DPU is under
> utilized for a 2
> >> socket system. Some are adding more sockets to the system to better
> utilize the
> >> DPU. The NIC bandwidth continues to grow significantly. I do not
> think there will
> >> be a multi-DPU per server scenario.
> >>
> >>
> >> Interesting... from my experience it is visa-versa:
> >> in many cases 200Gb/s is not that much these days to saturate modern
> 2 socket
> >> x86 server.
> >> Though I suppose a lot depends on particular HW and actual workload.
> >>
> >>>
> >>>>
> >>>>> Furtrhermore, I think this is a tradeoff between performance and
> >>>>> flexibility.
> >>>>> Our goal is to achieve best performance, this means we need to
> give
> >>>>> up some flexibility decisively. For example of 'FAST_FREE Mode',
> it
> >>>>> deletes most of the buffer check (refcnt > 1, external buffer,
> chain
> >>>>> buffer), chooses a shorest path, and then achieve significant
> >>>>> performance
> >>>> improvement.
> >>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
> >>>>>> similar to one we have for the ring?
> >>>>>> It would allow us on TX free path to avoid copying mbufs to
> >>>>>> temporary array on the stack.
> >>>>>> Instead we can put them straight from TX SW ring to the mempool
> cache.
> >>>>>> That should save extra store/load for mbuf and might help to
> >>>>>> achieve some performance gain without by-passing mempool.
> >>>>>> It probably wouldn't be as fast as what you proposing, but might
> be
> >>>>>> fast enough to consider as alternative.
> >>>>>> Again, it would be a generic one, so we can avoid all these
> >>>>>> implications and limitations.
> >>>>>
> >>>>> [Feifei] I think this is a good try. However, the most important
> >>>>> thing is that if we can bypass the mempool decisively to pursue
> the
> >>>>> significant performance gains.
> >>>>
> >>>> I understand the intention, and I personally think this is wrong
> and
> >>>> dangerous attitude.
> >>>> We have mempool abstraction in place for very good reason.
> >>>> So we need to try to improve mempool performance (and API if
> >>>> necessary) at first place, not to avoid it and break our own rules
> and
> >> recommendations.
> >>> The abstraction can be thought of at a higher level. i.e. the
> driver manages the
> >> buffer allocation/free and is hidden from the application. The
> application does
> >> not need to be aware of how these changes are implemented.
> >>>
> >>>>
> >>>>
> >>>>> For ZC, there maybe a problem for it in i40e. The reason for that
> >>>>> put Tx buffers into temporary is that i40e_tx_entry includes
> buffer
> >>>>> pointer and index.
> >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> need
> >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we
> still
> >>>>> can't avoid using a temporary stack to extract Tx buffer
> pointers.
> >>>>
> >>>> When talking about ZC API for mempool cache I meant something
> like:
> >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> >>>> uint32_t *nb_elem, uint32_t flags); void
> >>>> mempool_cache_put_zc_finish(struct
> >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return
> >>>> user a pointer inside mp-cache where to put free elems and max
> number
> >>>> of slots that can be safely filled.
> >>>> _finish_ will update mc->len.
> >>>> As an example:
> >>>>
> >>>> /* expect to free N mbufs */
> >>>> uint32_t n = N;
> >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> >>>>
> >>>> /* free up to n elems */
> >>>> for (i = 0; i != n; i++) {
> >>>>
> >>>>      /* get next free mbuf from somewhere */
> >>>>      mb = extract_and_prefree_mbuf(...);
> >>>>
> >>>>      /* no more free mbufs for now */
> >>>>      if (mb == NULL)
> >>>>         break;
> >>>>
> >>>>      p[i] = mb;
> >>>> }
> >>>>
> >>>> /* finalize ZC put, with _i_ freed elems */
> >>>> mempool_cache_put_zc_finish(mc, i);
> >>>>
> >>>> That way, I think we can overcome the issue with i40e_tx_entry you
> >>>> mentioned above. Plus it might be useful in other similar places.
> >>>>
> >>>> Another alternative is obviously to split i40e_tx_entry into two
> >>>> structs (one for mbuf, second for its metadata) and have a
> separate
> >>>> array for each of them.
> >>>> Though with that approach we need to make sure no perf drops will
> be
> >>>> introduced, plus probably more code changes will be required.
> >>> Commit '5171b4ee6b6" already does this (in a different way), but
> just for
> >> AVX512. Unfortunately, it does not record any performance
> improvements. We
> >> could port this to Arm NEON and look at the performance.
> >
> 

Reply via email to