Executive Summary:
My analysis shows that the mbuf library is not a barrier for fast-freeing
segmented packet mbufs, and thus fast-free of jumbo frames is possible.
Detailed Analysis:
The purpose of the mbuf fast-free Tx optimization is to reduce
rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
eliminating the code path related to indirect mbufs.
Optimally, we want to simplify the ethdev driver's function that frees the
transmitted mbufs, so it can free them directly to their mempool without
accessing the mbufs themselves.
If the driver cannot access the mbuf itself, it cannot determine which
mempool it belongs to.
We don't want the driver to access every mbuf being freed; but if all
mbufs of a Tx queue belong to the same mempool, the driver can determine
which mempool by looking into just one of the mbufs.
REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.
When an mbuf is freed to its mempool, some of the fields in the mbuf must
be initialized.
So, for fast-free, this must be done by the driver's function that
prepares the Tx descriptor.
This is a requirement to the driver, not a requirement to the application.
Now, let's dig into the code for freeing an mbuf.
Note: For readability purposes, I'll cut out some code and comments
unrelated to this topic.
static __rte_always_inline void
rte_pktmbuf_free_seg(struct rte_mbuf *m)
{
m = rte_pktmbuf_prefree_seg(m);
if (likely(m != NULL))
rte_mbuf_raw_free(m);
}
rte_mbuf_raw_free(m) is simple, so nothing to gain there:
/**
* Put mbuf back into its original mempool.
*
* The caller must ensure that the mbuf is direct and properly
* reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
* rte_pktmbuf_prefree_seg().
*/
static __rte_always_inline void
rte_mbuf_raw_free(struct rte_mbuf *m)
{
rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
rte_mempool_put(m->pool, m);
}
Note that the description says that the mbuf must be direct.
This is not entirely accurate; the mbuf is allowed to use a pinned
external buffer, if the mbuf holds the only reference to it.
(Most of the mbuf library functions have this documentation inaccuracy,
which should be fixed some day.)
So, the fast-free optimization really comes down to
rte_pktmbuf_prefree_seg(m), which must not return NULL.
Let's dig into that.
/**
* Decrease reference counter and unlink a mbuf segment
*
* This function does the same than a free, except that it does not
* return the segment to its pool.
* It decreases the reference counter, and if it reaches 0, it is
* detached from its parent for an indirect mbuf.
*
* @return
* - (m) if it is the last reference. It can be recycled or freed.
* - (NULL) if the mbuf still has remaining references on it.
*/
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
bool refcnt_not_one;
refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
return NULL;
if (unlikely(!RTE_MBUF_DIRECT(m))) {
rte_pktmbuf_detach(m);
if (RTE_MBUF_HAS_EXTBUF(m) &&
RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
__rte_pktmbuf_pinned_extbuf_decref(m))
return NULL;
}
if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
if (m->nb_segs != 1)
m->nb_segs = 1;
if (m->next != NULL)
m->next = NULL;
return m;
}
This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
(or reaches 0).
REQUIREMENT 2: The driver must hold the only reference to the mbuf,
i.e. 'm->refcnt' must be 1.
When the function succeeds, it initializes the mbuf fields as required by
rte_mbuf_raw_free() before returning.
Now, since the driver has exclusive access to the mbuf, it is free to
initialize the 'm->next' and 'm->nb_segs' at any time.
It could do that when preparing the Tx descriptor.
This is very interesting, because it means that fast-free does not
prohibit segmented packets!
(But the driver must have sufficient Tx descriptors for all segments in
the mbuf.)
Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
mbufs, i.e. cloned mbufs and mbufs with external buffer:
if (unlikely(!RTE_MBUF_DIRECT(m))) {
rte_pktmbuf_detach(m);
if (RTE_MBUF_HAS_EXTBUF(m) &&
RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
__rte_pktmbuf_pinned_extbuf_decref(m))
return NULL;
}
Starting with rte_pktmbuf_detach():
static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
{
struct rte_mempool *mp = m->pool;
uint32_t mbuf_size, buf_len;
uint16_t priv_size;
if (RTE_MBUF_HAS_EXTBUF(m)) {
/*
* The mbuf has the external attached buffer,
* we should check the type of the memory pool where
* the mbuf was allocated from to detect the pinned
* external buffer.
*/
uint32_t flags = rte_pktmbuf_priv_flags(mp);
if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
/*
* The pinned external buffer should not be
* detached from its backing mbuf, just exit.
*/
return;
}
__rte_pktmbuf_free_extbuf(m);
} else {
__rte_pktmbuf_free_direct(m);
}
priv_size = rte_pktmbuf_priv_size(mp);
mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
buf_len = rte_pktmbuf_data_room_size(mp);
m->priv_size = priv_size;
m->buf_addr = (char *)m + mbuf_size;
rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
m->buf_len = (uint16_t)buf_len;
rte_pktmbuf_reset_headroom(m);
m->data_len = 0;
m->ol_flags = 0;
}
The only quick and simple code path through this function is when the mbuf
uses a pinned external buffer:
if (RTE_MBUF_HAS_EXTBUF(m)) {
uint32_t flags = rte_pktmbuf_priv_flags(mp);
if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
return;
REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external buffer.
Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
if (RTE_MBUF_HAS_EXTBUF(m) &&
RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
__rte_pktmbuf_pinned_extbuf_decref(m))
return NULL;
Continuing with the next part of the block in rte_pktmbuf_prefree_seg():
/**
* @internal Handle the packet mbufs with attached pinned external buffer
* on the mbuf freeing:
*
* - return zero if reference counter in shinfo is one. It means there is
* no more reference to this pinned buffer and mbuf can be returned to
* the pool
*
* - otherwise (if reference counter is not one), decrement reference
* counter and return non-zero value to prevent freeing the backing mbuf.
*
* Returns non zero if mbuf should not be freed.
*/
static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
{
struct rte_mbuf_ext_shared_info *shinfo;
/* Clear flags, mbuf is being freed. */
m->ol_flags = RTE_MBUF_F_EXTERNAL;
shinfo = m->shinfo;
/* Optimize for performance - do not dec/reinit */
if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
return 0;
/*
* Direct usage of add primitive to avoid
* duplication of comparing with one.
*/
if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
rte_memory_order_acq_rel) - 1))
return 1;
/* Reinitialize counter before mbuf freeing. */
rte_mbuf_ext_refcnt_set(shinfo, 1);
return 0;
}
Essentially, if the mbuf does use a pinned external buffer,
rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
only referred to by the mbuf.
REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
hold the only reference to that pinned external buffer, i.e. in that case,
'm->shinfo->refcnt' must be 1.
Please review.
If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
segmented packet mbufs, and thus fast-free of jumbo frames is possible.
We need a driver developer to confirm that my suggested approach -
resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
preparing the Tx descriptor - is viable.