> -----Original Message----- > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of jigsaw > Sent: Thursday, April 13, 2017 9:59 PM > To: Adrien Mazarguil <adrien.mazarg...@6wind.com> > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection > > Hi Adrien, > > Thanks for your comment. > > The LOCK/UNLOCK may be called by user application only. There are several > reasons. > > 1. If the lib calls LOCK, user application may be punished unexpectedly. > Consider what if the Rx burst function calls the LOCK in core #1, and then > the mbuf is handed over from > core #1 to core #2 through a enqueue/dequeue operation, as below: > > Rx(1) --> Enqueue(1) --> Dequeue(2) > LOCKED Panic! > > The core #2 will then panic because the mbuf is owned by core #1 without > being released. > > 2. Rx and Tx both usually works in a burst mode, combined with prefetch > operation. Meanwhile > LOCK and UNLOCK cannot work well with prefetch, because it requires data > access of mbuf header. > > 3. The critical session tends to be small. Here we (user application) need > to find a balance: with longer interval of critical > section, we can have more secured code; however, longer duration leads to > more complicated business > logic. > > Hence, we urge user application to use LOCK/UNLOCK with the common sense of > critical sections. > > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown > below: > > ========================================== > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h > b/examples/l3fwd/l3fwd_em_hlm_sse.h > index 7714a20..9db0190 100644 > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > **pkts_burst, > > for (j = 0; j < n; j += 8) { > > + RTE_MBUF_LOCK(pkts_burst[j]); > + RTE_MBUF_LOCK(pkts_burst[j+1]); > + RTE_MBUF_LOCK(pkts_burst[j+2]); > + RTE_MBUF_LOCK(pkts_burst[j+3]); > + RTE_MBUF_LOCK(pkts_burst[j+4]); > + RTE_MBUF_LOCK(pkts_burst[j+5]); > + RTE_MBUF_LOCK(pkts_burst[j+6]); > + RTE_MBUF_LOCK(pkts_burst[j+7]); > + > uint32_t pkt_type = > pkts_burst[j]->packet_type & > pkts_burst[j+1]->packet_type & > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf > **pkts_burst, > } > } > > - for (; j < nb_rx; j++) > + for (; j < nb_rx; j++) { > + RTE_MBUF_LOCK(pkts_burst[j]); > dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid); > + } > > send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h > index 1afa1f0..2938558 100644 > --- a/examples/l3fwd/l3fwd_sse.h > +++ b/examples/l3fwd/l3fwd_sse.h > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port, > struct rte_mbuf *m[], > > len = qconf->tx_mbufs[port].len; > > + for (j = 0; j < num; ++j) > + RTE_MBUF_UNLOCK(m); > + > /* > * If TX buffer for that queue is empty, and we have enough packets, > * then send them straightway. > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct > rte_mbuf **pkts_burst, > if (likely(pn != BAD_PORT)) > send_packetsx4(qconf, pn, pkts_burst + j, k); > else > - for (m = j; m != j + k; m++) > + for (m = j; m != j + k; m++) { > + RTE_MBUF_UNLOCK(pkts_burst[m]); > rte_pktmbuf_free(pkts_burst[m]); > + } > > } > } > ========================================== > > Note that data race may or may not have visible consequence. If two cores > unconsciously process same mbuf > at different time, they may never notice it; but if two cores access same > mbuf at the same physical time, the > consequence is usually visible (crash). We don't seek for a solution that > captures even potential data race; > instead, we seek for a solution that can capture data race that happens > simultaneously in two or more cores. > Therefore, we do not need to extend the border of locking as wide as > possible. We will apply locking only when > we are focusing on a single mbuf processing. > > In a real life application, a core will spend quite some time on each mbuf. > The interval ranges from a few hundred > cycles to a few thousand cycles. And usually not more than a handful of > mbuf are involved. This is a ideal use case > for locking mbuf. > > I agree that the race detection shall not be compiled by default, since > mtod is often called, and every mtod implies a > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be > supported as well. But I don't think refcnt logic > should be taken into account; these two are orthogonal designs, IMO. ***** > Pls correct me if I am wrong here. ***** > > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That > is why I said LOCK/UNLOCK needs to > survive mbuf alloc initialization. > > Of course we need to support locking multiple mbufs at the same time. For > each core, we will then preserve, say, 8 slots. > It works exactly like a direct mapped cacheline. That is, we can use 4bits > from the mbuf address to locate its cacheline. > If the cacheline has been occupied, we do an eviction; that is, the new > mbuf will take the place of the old one. The old one is > then UNLOCKed, unfortunately. > > Honestly I have not yet tried this approach in real life application. But I > have been thinking over the problem of data race detection > for a long time, and I found the restriction and requirement makes this > solution the only viable one. There are hundreds of papers > published in the field on data race condition detection, but the lightest > of the options has at least 5x of performance > penalty, not to mention the space complexity, making it not applicable in > practice. > > Again, pls, anyone has same painful experience of data race bugs, share > with me your concerns. It would be nice to come up with > some practical device to address this problem. I believe 6Wind and other IP > stack vendors must share the same feeling and opinion. > > thx & > rgds, > > Qinglai > > > > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil < > adrien.mazarg...@6wind.com> wrote: > > > Hi Qinglai, > > > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > > > Hi, > > > > > > I have a proposal for mbuf race condition detection and I would like to > > get > > > your opinions before > > > committing any patch. > > > > > > Race condition is the worst bug I can think of; as it causes crashing > > long > > > after the first crime scene, > > > and the consequence is unpredictable and difficult to understand. > > > > > > Besides, race condition is very difficult to reproduce. Usually we get a > > > coredump from live network, > > > but the coredump itself delivers nearly zero info. We just know that the > > > mbuf is somehow broken, and > > > it is perhaps overwritten by another core at any moment. > > > > > > There are tools such as Valgrind and ThreadSanitizer to capture this > > fault. > > > But the overhead brought > > > by the tools are too high to be deployed in performance test, not to > > > mention in the live network. But > > > race condition rarely happens under low pressure. > > > > > > Even worse, even in theory, the tools mentioned are not capable of > > > capturing the scenario, because > > > they requires explicit calls on locking primitives such as pthread mutex. > > > This is because the semantics > > > are not understood by the tools without explicit lock/unlock. > > > > > > With such known restrictions, I have a proposal roughly as below. > > > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf that > > > matters. The pseudo code is as below: > > > > > > RTE_MBUF_LOCK(m); > > > /* use mbuf as usual */ > > > ... > > > RTE_MBUF_UNLOCK(m); > > > > > > During the protected critical section, only the core which holds the lock > > > can access the mbuf; while > > > other cores, if they dare to use mbuf, they will be punished by segfault. > > > > > > Since the proposal shall be feasible at least in performance test, the > > > overhead of locking and > > > punishment must be small. And the access to mbuf must be as usual. > > > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): > > > > > > RTE_MBUF_LOCK(m) > > > { > > > store_per_core_cache(m, m->buf_addr); > > > m->buf_addr = NULL; > > > mb(); // memory barrier > > > } > > > > > > And RTE_MBUF_UNLOCK is simply the reverse: > > > > > > RTE_MBUF_UNLOCK(m) > > > { > > > purge_per_core_cache(m); > > > m->buf_addr = load_per_core_cache(m); > > > mb(); > > > } > > > > > > As we can see that the idea is to re-write m->buf_addr with NULL, and > > store > > > it in a per-core > > > cache. The other cores, while trying to access the mbuf during critical > > > section, will be certainly > > > punished. > > > > > > Then of course we need to keep the owner core working. This is done by > > > making changes to > > > mtod, as below: > > > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > > > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + > > > (o))) > > > > > > The per-core cache of mbuf works like a key-value database, which works > > > like a direct-mapping > > > cache. If an eviction happens (a newly arriving mbuf must kick out an old > > > one), then the we restore > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then > > > take care of such > > > situation. > > > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed > > and > > > allocated, since > > > we don't trust the refcnt. A double-free is actually very rare; but data > > > race can occur without double-free. Therefore, we need to survive the > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the > > > buf_addr. > > > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised. > > > But it is not a big deal (I hope). > > > > > > The runtime overhead of this implementation is very light-weighted, and > > > probably is able to turned > > > on even in live network. And it is not intrusive at all: no change is > > > needed in struct rte_mbuf; we just > > > need a O(N) space complexity (where N is number of cores), and O(1) > > runtime > > > complexity. > > > > > > Alright, that is basically what is in my mind. Before any patch is > > > provided, or any perf analysis is made, I would like to know what are the > > > risks form design point of view. Please leave your feedback. > > > > Your proposal makes sense but I'm not sure where developers should call > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications > > only? Is it to be used internally by DPDK as well? Does it include PMDs? > > > > I think any overhead outside of debugging mode, as minimal as it is, is too > > much of a "punishment" for well behaving applications. The cost of a memory > > barrier can be extremely high and this is one of the reasons DPDK processes > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and > > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled > > by default, and thought as an additional debugging tool. > > > > Also the implementation would probably be more expensive than your > > suggestion, in my opinion the reference count must be taken into account > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs > > work. Freeing mbufs multiple times is perfectly valid in many cases. > > > > How can one lock several mbufs at once by the way? > > > > Since it affects performance, this can only make sense as a debugging tool > > developers can use when they encounter some corruption issue they cannot > > identify, somewhat like poisoning buffers before they are freed. Not sure > > it's worth the trouble.
I am agree with Adrian here - it doesn't look worth it: From one side it adds quite a lot of overhead and complexity to the mbuf code. From other side it usage seems pretty limited - there would be a lot of cases when it wouldn't be able to detect race condition anyway.. So my opinion is NACK. BTW, if your app is intentionally trying to do read/write to the same mbufs simultaneously from different threads without some explicit synchronization, then most likely there is something wrong with it. Konstantin