> From: Bruce Richardson [mailto:[email protected]]
> Sent: Wednesday, 14 January 2026 17.36
>
> On Wed, Jan 14, 2026 at 04:31:31PM +0100, Morten Brørup wrote:
> > > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > > freeing
> > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > possible.
> > > >
> > > > We need a driver developer to confirm that my suggested approach
> -
> > > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > > preparing the Tx descriptor - is viable.
> > > >
> > > Excellent analysis, Morten. If I get a chance some time this
> release
> > > cycle,
> > > I will try implementing this change in our drivers, see if any
> > > difference
> > > is made.
> >
> > Bruce,
> >
> > Have you had a chance to look into the driver change requirements?
> > If not, could you please try scratching the surface, to build a gut
> feeling.
>
> I'll try and take a look this week. Juggling a few things at the
> moment, so
> I had forgotten about this. Sorry.
>
> More comments inline below.
>
> /Bruce
>
> >
> > I wonder if the vector implementations have strong requirements that
> packets are not segmented...
> >
> > The i40 driver only sets "tx_simple_allowed" and "tx_vec_allowed"
> flags when MBUF_FAST_FREE is set:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L3502
> >
>
> Actually, it allows but does not require FAST_FREE. The check is just
> verifying that the flags with everything *but* FAST_FREE masked out is
> the
> same as the original flags, i.e. FAST_FREE is just ignored.
That's not how I read the code:
ad->tx_simple_allowed =
(txq->offloads ==
(txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) &&
txq->tx_rs_thresh >= I40E_TX_MAX_BURST);
Look at it with offloads=(MULTI_SEGS|FAST_FREE):
simple_allowed = (MULTI_SEGS|FAST_FREE) == (MULTI_SEGS|FAST_FREE) & FAST_FREE
i.e.:
simple_allowed = (MULTI_SEGS|FAST_FREE) == FAST_FREE
i.e.: false
>
> > And only when these two flags are set, it uses a vector Tx function:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L3550
> > And a special Tx Prep function:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L3584
> > Which fails if nb_segs != 1:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L1675
> >
> > So currently it does.
> > But does it need to?... That is the question.
> > Paraphrasing:
> > Can the Tx function only be vectorized when the code path doesn't
> have branches depending on the number of segments?
> > If so, then this may be the main reason for not supporting segmented
> packets with FAST_FREE.
> >
> > In that case, we cannot remove the single-segment requirement from
> FAST_FREE without sacrificing the performance boost from vectorizing.
>
> No, based on what I state above, this should not be a blocker. The
> vector
> paths do require us to guarantee only one segment per packet - without
> additional context descriptors - so only one descriptor per packet
> (generally, or always one + ctx, in one code-path case). FAST_FREE can
> be
> used in conjunction with that but should not be a requirement. See [1]
> where in vector cleanup we explicitly check for FAST_FREE.
>
> Similarly for scalar code path, in my latest rework, I am attempting to
> standardize the use of FAST_FREE optimizations even when we have a
> slightly
> slower Tx path [2].
Good point:
The Tx path has two steps:
1) Pre-transmission Tx descriptor setup.
2) Post-transmission mbuf free.
FAST_FREE requirements for optimizing each of these two steps might differ.
As suggested in my other email, hopefully the post-transmission step can be
vectorized (also for multi-segment packets) by assisting it in the
pre-transmission step - i.e. by preparing the FAST_FREE segments for direct
release to the mempool.
Then we can consider single-segment requirements for the pre-transmission step.
>
> [1]
> https://github.com/DPDK/dpdk/blob/main/drivers/net/intel/common/tx.h
> [2] https://patches.dpdk.org/project/dpdk/patch/20260113151505.1871271-
> [email protected]/
>
> >
> > But then we can proceed pursuing alternative optimizations, as
> suggested by Konstantin.
> >
> > Here's another idea:
> > The Tx function could pre-scan each Tx burst for multi-segment
> packets, to decide if the burst should be processed by the vector code
> path or a fallback code path (which can also handle multi-segment
> packets).
> >
> >
> > -Morten
> >