That's very interesting food for thoughts.
I hope we will have a good community discussion on this list
during this week to make some decisions.


01/11/2020 10:12, Morten Brørup:
> > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > Sent: Saturday, October 31, 2020 9:41 PM
> > 
> > 31/10/2020 19:20, Morten Brørup:
> > > Thomas,
> > >
> > > Adding my thoughts to the already detailed feedback on this important
> > patch...
> > >
> > > The first cache line is not inherently "hotter" than the second. The
> > hotness depends on their usage.
> > >
> > > The mbuf cacheline1 marker has the following comment:
> > > /* second cache line - fields only used in slow path or on TX */
> > >
> > > In other words, the second cache line is intended not to be touched in
> > fast path RX.
> > >
> > > I do not think this is true anymore. Not even with simple non-scattered
> > RX. And regression testing probably didn't catch this, because the tests
> > perform TX after RX, so the cache miss moved from TX to RX and became a
> > cache hit in TX instead. (I may be wrong about this claim, but it's not
> > important for the discussion.)
> > >
> > > I think the right question for this patch is: Can we achieve this - not
> > using the second cache line for fast path RX - again by putting the right
> > fields in the first cache line?
> > >
> > > Probably not in all cases, but perhaps for some...
> > >
> > > Consider the application scenarios.
> > >
> > > When a packet is received, one of three things happens to it:
> > > 1. It is immediately transmitted on one or more ports.
> > > 2. It is immediately discarded, e.g. by a firewall rule.
> > > 3. It is put in some sort of queue, e.g. a ring for the next pipeline
> > stage, or in a QoS queue.
> > >
> > > 1. If the packet is immediately transmitted, the m->tx_offload field in
> > the second cache line will be touched by the application and TX function
> > anyway, so we don't need to optimize the mbuf layout for this scenario.
> > >
> > > 2. The second scenario touches m->pool no matter how it is implemented.
> > The application can avoid touching m->next by using rte_mbuf_raw_free(),
> > knowing that the mbuf came directly from RX and thus no other fields have
> > been touched. In this scenario, we want m->pool in the first cache line.
> > >
> > > 3. Now, let's consider the third scenario, where RX is followed by
> > enqueue into a ring. If the application does nothing but put the packet
> > into a ring, we don't need to move anything into the first cache line. But
> > applications usually does more... So it is application specific what would
> > be good to move to the first cache line:
> > >
> > > A. If the application does not use segmented mbufs, and performs analysis
> > and preparation for transmission in the initial pipeline stages, and only
> > the last pipeline stage performs TX, we could move m->tx_offload to the
> > first cache line, which would keep the second cache line cold until the
> > actual TX happens in the last pipeline stage - maybe even after the packet
> > has waited in a QoS queue for a long time, and its cache lines have gone
> > cold.
> > >
> > > B. If the application uses segmented mbufs on RX, it might make sense
> > moving m->next to the first cache line. (We don't use segmented mbufs, so
> > I'm not sure about this.)
> > >
> > >
> > > However, reality perhaps beats theory:
> > >
> > > Looking at the E1000 PMD, it seems like even its non-scattered RX
> > function, eth_igb_recv_pkts(), sets m->next. If it only kept its own free
> > pool pre-initialized instead... I haven't investigated other PMDs, except
> > briefly looking at the mlx5 PMD, and it seems like it doesn't touch m->next
> > in RX.
> > >
> > > I haven't looked deeper into how m->pool is being used by RX in PMDs, but
> > I suppose that it isn't touched in RX.
> > >
> > > <rant on>
> > > If only we had a performance test where RX was not immediately followed
> > by TX, but the packets were passed through a large queue in-between, so RX
> > cache misses were not free of charge because they transform TX cache misses
> > into cache hits instead...
> > > <rant off>
> > >
> > > Whatever you choose, I am sure that most applications will find it more
> > useful than the timestamp. :-)
> > 
> > Thanks for the thoughts Morten.
> > I believe we need benchmarks of different scenarios with different drivers.
> >
> 
> If we are only allowed to modify the mbuf structure this one more time, we 
> should look forward, not backwards!
> 
> If we move m->tx_offload to the first cache line, applications using simple, 
> non-scattered packet mbufs would never even need to touch the second cache 
> line, except for freeing the mbuf (which needs to read m->pool).
> 
> And this leads to my next suggestion...
> 
> One thing has always puzzled me: Why do we use 64 bits to indicate which 
> memory pool an mbuf belongs to? The portid only uses 16 bits and an 
> indirection index. Why don't we use the same kind of indirection index for 
> mbuf pools?
> 
> I can easily imagine using one mbuf pool (or perhaps a few pools) per CPU 
> socket (or per physical memory bus closest to an attached NIC), but not more 
> than 256 mbuf memory pools in total. So, let's introduce an mbufpoolid like 
> the portid, and cut this mbuf field down from 64 to 8 bits.
> 
> If we also cut down m->pkt_len from 32 to 24 bits, we can get the 8 bit mbuf 
> pool index into the first cache line at no additional cost.
> 
> In other words: This would free up another 64 bit field in the mbuf structure!
> 
> 
> And even though the m->next pointer for scattered packets resides in the 
> second cache line, the libraries and application knows that m->next is NULL 
> when m->nb_segs is 1. This proves that my suggestion would make touching the 
> second cache line unnecessary (in simple cases), even for re-initializing the 
> mbuf.
> 
> 
> And now I will proceed out on a tangent with two more independent thoughts, 
> so feel free to ignore.
> 
> Consider a multi CPU socket system with one mbuf pool per CPU socket, the 
> NICs attached to each CPU socket use an RX mbuf pool with RAM on the same CPU 
> socket. I would imagine that (re-)initializing these mbufs could be faster if 
> performed only on a CPU on the same socket. If this is the case, mbufs should 
> be re-initialized as part of the RX preparation at ingress, not as part of 
> the mbuf free at egress.
> 
> Perhaps some microarchitectures are faster to compare nb_segs==0 than 
> nb_segs==1. If so, nb_segs could be redefined to mean number of additional 
> segments, rather than number of segments.



Reply via email to