> > I am talking about different thing:
> > I think with some extra effort driver can use (in some cases)
> > rte_mbuf_raw_free_bulk() even when RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> > is not specified.
> > Let say we can make txq->fast_free_mp[] an array with the same size as txq-
> > >txep[].
> > At tx_burst() when filling txep[] we can do pre_free() checks for that mbuf,
> > and in case of success store it's mempool pointer in corresponding txq-
> > >fast_free_mp[],
> > otherwise put NULL there.
> > Then at tx_free() we can scan fast_free_mp[] and invoke raw_free() for
> > non-
> > NULL entries.
> > Again, for now it is just an idea probably worth to think about.
>
> Yes, that seems like an excellent idea, certainly worth considering!
>
> At tx_free(), the mbufs might be cold, so not accessing them at this point
> improves performance. (Which is also the point of my
> patch.)
Yes.
>
> At tx_burst(), the mbufs are read anyway (their information is written into
> the tx descriptors), so the mbufs are hot in the cache at
> this point.
Yes.
> Best case with your suggestion, rte_pktmbuf_prefree_seg() doesn't write the
> mbuf, so the performance cost of doing it at tx_burst()
> is extremely low.
Yes.
> Worst case with your suggestion, rte_pktmbuf_prefree_seg() does write the
> mbuf, so the mbuf write operation simply moves from
> tx_free() to tx_burst().
> However, in tx_burst(), the mbuf is already hot in the cache, so per
> transmitted mbuf, we get one load+store at tx_burst() instead of
> one load at tx_burst() + one load+store at tx_free().
I suppose you plan to invoke full rte_pktmbuf_prefree_seg() here?
Unfortunately, I don't think it is possible - for cases when refcnt > 1, we
need to decrement refcnt only when we are ready to
release the mbuf. Otherwise we can end up with NIC HW reading from already
released (and probably re-used) mbuf.
What we probably need is a lightweight version of rte_pktmbuf_prefree_seg()
that would return not-NULL value only when
refcnt==1, and segment and not indirect mbuf or external memory attached.
Something like:
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_check(sconst truct rte_mbuf *m)
{
if (rte_mbuf_refcnt_read(m) == 1 && RTE_MBUF_DIRECT(m))
return m;
return NULL;
}
So at worst case (when such check will return NULL) we still need to do
load+store at tx_free().