That's very interesting food for thoughts. I hope we will have a good community discussion on this list during this week to make some decisions.
01/11/2020 10:12, Morten Brørup: > > From: Thomas Monjalon [mailto:tho...@monjalon.net] > > Sent: Saturday, October 31, 2020 9:41 PM > > > > 31/10/2020 19:20, Morten Brørup: > > > Thomas, > > > > > > Adding my thoughts to the already detailed feedback on this important > > patch... > > > > > > The first cache line is not inherently "hotter" than the second. The > > hotness depends on their usage. > > > > > > The mbuf cacheline1 marker has the following comment: > > > /* second cache line - fields only used in slow path or on TX */ > > > > > > In other words, the second cache line is intended not to be touched in > > fast path RX. > > > > > > I do not think this is true anymore. Not even with simple non-scattered > > RX. And regression testing probably didn't catch this, because the tests > > perform TX after RX, so the cache miss moved from TX to RX and became a > > cache hit in TX instead. (I may be wrong about this claim, but it's not > > important for the discussion.) > > > > > > I think the right question for this patch is: Can we achieve this - not > > using the second cache line for fast path RX - again by putting the right > > fields in the first cache line? > > > > > > Probably not in all cases, but perhaps for some... > > > > > > Consider the application scenarios. > > > > > > When a packet is received, one of three things happens to it: > > > 1. It is immediately transmitted on one or more ports. > > > 2. It is immediately discarded, e.g. by a firewall rule. > > > 3. It is put in some sort of queue, e.g. a ring for the next pipeline > > stage, or in a QoS queue. > > > > > > 1. If the packet is immediately transmitted, the m->tx_offload field in > > the second cache line will be touched by the application and TX function > > anyway, so we don't need to optimize the mbuf layout for this scenario. > > > > > > 2. The second scenario touches m->pool no matter how it is implemented. > > The application can avoid touching m->next by using rte_mbuf_raw_free(), > > knowing that the mbuf came directly from RX and thus no other fields have > > been touched. In this scenario, we want m->pool in the first cache line. > > > > > > 3. Now, let's consider the third scenario, where RX is followed by > > enqueue into a ring. If the application does nothing but put the packet > > into a ring, we don't need to move anything into the first cache line. But > > applications usually does more... So it is application specific what would > > be good to move to the first cache line: > > > > > > A. If the application does not use segmented mbufs, and performs analysis > > and preparation for transmission in the initial pipeline stages, and only > > the last pipeline stage performs TX, we could move m->tx_offload to the > > first cache line, which would keep the second cache line cold until the > > actual TX happens in the last pipeline stage - maybe even after the packet > > has waited in a QoS queue for a long time, and its cache lines have gone > > cold. > > > > > > B. If the application uses segmented mbufs on RX, it might make sense > > moving m->next to the first cache line. (We don't use segmented mbufs, so > > I'm not sure about this.) > > > > > > > > > However, reality perhaps beats theory: > > > > > > Looking at the E1000 PMD, it seems like even its non-scattered RX > > function, eth_igb_recv_pkts(), sets m->next. If it only kept its own free > > pool pre-initialized instead... I haven't investigated other PMDs, except > > briefly looking at the mlx5 PMD, and it seems like it doesn't touch m->next > > in RX. > > > > > > I haven't looked deeper into how m->pool is being used by RX in PMDs, but > > I suppose that it isn't touched in RX. > > > > > > <rant on> > > > If only we had a performance test where RX was not immediately followed > > by TX, but the packets were passed through a large queue in-between, so RX > > cache misses were not free of charge because they transform TX cache misses > > into cache hits instead... > > > <rant off> > > > > > > Whatever you choose, I am sure that most applications will find it more > > useful than the timestamp. :-) > > > > Thanks for the thoughts Morten. > > I believe we need benchmarks of different scenarios with different drivers. > > > > If we are only allowed to modify the mbuf structure this one more time, we > should look forward, not backwards! > > If we move m->tx_offload to the first cache line, applications using simple, > non-scattered packet mbufs would never even need to touch the second cache > line, except for freeing the mbuf (which needs to read m->pool). > > And this leads to my next suggestion... > > One thing has always puzzled me: Why do we use 64 bits to indicate which > memory pool an mbuf belongs to? The portid only uses 16 bits and an > indirection index. Why don't we use the same kind of indirection index for > mbuf pools? > > I can easily imagine using one mbuf pool (or perhaps a few pools) per CPU > socket (or per physical memory bus closest to an attached NIC), but not more > than 256 mbuf memory pools in total. So, let's introduce an mbufpoolid like > the portid, and cut this mbuf field down from 64 to 8 bits. > > If we also cut down m->pkt_len from 32 to 24 bits, we can get the 8 bit mbuf > pool index into the first cache line at no additional cost. > > In other words: This would free up another 64 bit field in the mbuf structure! > > > And even though the m->next pointer for scattered packets resides in the > second cache line, the libraries and application knows that m->next is NULL > when m->nb_segs is 1. This proves that my suggestion would make touching the > second cache line unnecessary (in simple cases), even for re-initializing the > mbuf. > > > And now I will proceed out on a tangent with two more independent thoughts, > so feel free to ignore. > > Consider a multi CPU socket system with one mbuf pool per CPU socket, the > NICs attached to each CPU socket use an RX mbuf pool with RAM on the same CPU > socket. I would imagine that (re-)initializing these mbufs could be faster if > performed only on a CPU on the same socket. If this is the case, mbufs should > be re-initialized as part of the RX preparation at ingress, not as part of > the mbuf free at egress. > > Perhaps some microarchitectures are faster to compare nb_segs==0 than > nb_segs==1. If so, nb_segs could be redefined to mean number of additional > segments, rather than number of segments.