On Mon, Sep 29, 2025 at 6:41 AM Xu Kuohai <[email protected]> wrote: > > On 9/20/2025 6:10 AM, Andrii Nakryiko wrote: > > On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai <[email protected]> wrote: > >> > >> From: Xu Kuohai <[email protected]> > >> > >> When the bpf ring buffer is full, new events can not be recorded util > > > > typo: until > > > > ACK > > >> the consumer consumes some events to free space. This may cause critical > >> events to be discarded, such as in fault diagnostic, where recent events > >> are more critical than older ones. > >> > >> So add ovewrite mode for bpf ring buffer. In this mode, the new event > > > > overwrite, BPF > > > > ACK > > >> overwrites the oldest event when the buffer is full. > >> > >> The scheme is as follows: > >> > >> 1. producer_pos tracks the next position to write new data. When there > >> is enough free space, producer simply moves producer_pos forward to > >> make space for the new event. > >> > >> 2. To avoid waiting for consumer to free space when the buffer is full, > >> a new variable overwrite_pos is introduced for producer. overwrite_pos > >> tracks the next event to be overwritten (the oldest event committed) in > >> the buffer. producer moves it forward to discard the oldest events when > >> the buffer is full. > >> > >> 3. pending_pos tracks the oldest event under committing. producer ensures > > > > "under committing" is confusing. Oldest event to be committed? > > > > Yes, 'the oldest event to be committed'. Thanks! > > >> producers_pos never passes pending_pos when making space for new > >> events. > >> So multiple producers never write to the same position at the same > >> time. > >> > >> 4. producer wakes up consumer every half a round ahead to give it a chance > >> to retrieve data. However, for an overwrite-mode ring buffer, users > >> typically only cares about the ring buffer snapshot before a fault > >> occurs. > >> In this case, the producer should commit data with BPF_RB_NO_WAKEUP > >> flag > >> to avoid unnecessary wakeups. > >> > >> To make it clear, here are some example diagrams. > >> > >> 1. Let's say we have a ring buffer with size 4096. > >> > >> At first, {producer,overwrite,pending,consumer}_pos are all set to 0 > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | > >> | > >> | > >> | > >> | > >> | > >> > >> +-----------------------------------------------------------------------+ > >> ^ > >> | > >> | > >> producer_pos = 0 > >> overwrite_pos = 0 > >> pending_pos = 0 > >> consumer_pos = 0 > >> > >> 2. Reserve event A, size 512. > >> > >> There is enough free space, so A is allocated at offset 0 and > >> producer_pos > >> is moved to 512, the end of A. Since A is not submitted, the BUSY bit > >> is > >> set. > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | | > >> | > >> | A | > >> | > >> | [BUSY] | > >> | > >> > >> +-----------------------------------------------------------------------+ > >> ^ ^ > >> | | > >> | | > >> | producer_pos = 512 > >> | > >> overwrite_pos = 0 > >> pending_pos = 0 > >> consumer_pos = 0 > >> > >> 3. Reserve event B, size 1024. > >> > >> B is allocated at offset 512 with BUSY bit set, and producer_pos is > >> moved > >> to the end of B. > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | | | > >> | > >> | A | B | > >> | > >> | [BUSY] | [BUSY] | > >> | > >> > >> +-----------------------------------------------------------------------+ > >> ^ ^ > >> | | > >> | | > >> | producer_pos = 1536 > >> | > >> overwrite_pos = 0 > >> pending_pos = 0 > >> consumer_pos = 0 > >> > >> 4. Reserve event C, size 2048. > >> > >> C is allocated at offset 1536 and producer_pos becomes 3584. > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | | | | > >> | > >> | A | B | C | > >> | > >> | [BUSY] | [BUSY] | [BUSY] | > >> | > >> > >> +-----------------------------------------------------------------------+ > >> ^ ^ > >> | | > >> | | > >> | producer_pos = > >> 3584 > >> | > >> overwrite_pos = 0 > >> pending_pos = 0 > >> consumer_pos = 0 > >> > >> 5. Submit event A. > >> > >> The BUSY bit of A is cleared. B becomes the oldest event under > >> writing, so > > > > Now it's "under writing" :) To be committed? Or "pending committing" > > or just "pending", I guess. But not under anything, it just confuses > > readers. IMO. > > > > Once again, 'oldest event to be committed'. > > I should check it with an AI agent first. > > >> pending_pos is moved to 512, the start of B. > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | | | | > >> | > >> | A | B | C | > >> | > >> | | [BUSY] | [BUSY] | > >> | > >> > >> +-----------------------------------------------------------------------+ > >> ^ ^ ^ > >> | | | > >> | | | > >> | pending_pos = 512 producer_pos = > >> 3584 > >> | > >> overwrite_pos = 0 > >> consumer_pos = 0 > >> > >> 6. Submit event B. > >> > >> The BUSY bit of B is cleared, and pending_pos is moved to the start > >> of C, > >> which is the oldest event under writing now. > > > > ditto > > > > Again and again :( > > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | | | | > >> | > >> | A | B | C | > >> | > >> | | | [BUSY] | > >> | > >> > >> +-----------------------------------------------------------------------+ > >> ^ ^ ^ > >> | | | > >> | | | > >> | pending_pos = 1536 producer_pos = > >> 3584 > >> | > >> overwrite_pos = 0 > >> consumer_pos = 0 > >> > >> 7. Reserve event D, size 1536 (3 * 512). > >> > >> There are 2048 bytes not under writing between producer_pos and > >> pending_pos, > >> so D is allocated at offset 3584, and producer_pos is moved from 3584 > >> to > >> 5120. > >> > >> Since event D will overwrite all bytes of event A and the begining > >> 512 bytes > > > > typo: beginning, but really "first 512 bytes" would be clearer > > > > OK, I’ll switch to 'first 512 bytes' for clarity. > > >> of event B, overwrite_pos is moved to the start of event C, the > >> oldest event > >> that is not overwritten. > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | | | | > >> | > >> | D End | | C | D > >> Begin| > >> | [BUSY] | | [BUSY] | > >> [BUSY] | > >> > >> +-----------------------------------------------------------------------+ > >> ^ ^ ^ > >> | | | > >> | | pending_pos = 1536 > >> | | overwrite_pos = 1536 > >> | | > >> | producer_pos=5120 > >> | > >> consumer_pos = 0 > >> > >> 8. Reserve event E, size 1024. > >> > >> Though there are 512 bytes not under writing between producer_pos and > >> pending_pos, E can not be reserved, as it would overwrite the first > >> 512 > >> bytes of event C, which is still under writing. > >> > >> 9. Submit event C and D. > >> > >> pending_pos is moved to the end of D. > >> > >> 0 512 1024 1536 2048 2560 3072 3584 > >> 4096 > >> > >> +-----------------------------------------------------------------------+ > >> | | | | > >> | > >> | D End | | C | D > >> Begin| > >> | | | | > >> | > >> > >> +-----------------------------------------------------------------------+ > >> ^ ^ ^ > >> | | | > >> | | overwrite_pos = 1536 > >> | | > >> | producer_pos=5120 > >> | pending_pos=5120 > >> | > >> consumer_pos = 0 > >> > >> The performance data for overwrite mode will be provided in a follow-up > >> patch that adds overwrite mode benchs. > >> > >> A sample of performance data for non-overwrite mode on an x86_64 and arm64 > >> CPU, before and after this patch, is shown below. As we can see, no obvious > >> performance regression occurs. > >>
[...] > >> Signed-off-by: Xu Kuohai <[email protected]> > >> --- > >> include/uapi/linux/bpf.h | 4 + > >> kernel/bpf/ringbuf.c | 159 +++++++++++++++++++++++++++------ > >> tools/include/uapi/linux/bpf.h | 4 + > >> 3 files changed, 141 insertions(+), 26 deletions(-) > >> [...] > >> + > >> + over_pos = READ_ONCE(rb->overwrite_pos); > >> + return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1); > > > > I'm trying to understand why you need to min with `rb->mask + 1`, can > > you please elaborate? > > > We need the min because rb->producer_pos and rb->overwrite_pos are read > at different times. During this gap, a fast producer may wrap once or > more, making over_pos larger than prod_pos. > what if you read overwrite_pos before reading producer_pos? Then it can't be larger than producer_pos and available data would be producer_pos - max(consumer_pos, overwrite_pos)? would that work? > > And also, at least for consistency, use smp_load_acquire() for > > overwrite_pos? > > > > Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(), > where overwrite_pos is WRITE_ONCE first, followed by > smp_store_release(producer_pos). > So here we do smp_load_acquire(producer_pos) first, then > READ_ONCE(overwrite_pos) > to ensure a consistent view of the ring buffer. > > For consistency when reading consumer_pos and producer_pos, I’m fine with > switching READ_ONCE to smp_load_acquire for overwrite_pos. > I'm not sure it matters much, but this function is called outside of rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while doing that WRITE_ONCE(). So it might not make any difference, but I have mild preference for smp_load_acquire() here. > >> } > >> > >> static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) > >> @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr > >> *hdr) > >> return (void*)((addr & PAGE_MASK) - off); > >> } > >> > >> + [...] > >> + /* In overwrite mode, move overwrite_pos to the next record to be > >> + * overwritten if the ring buffer is full > >> + */ > > > > hm... here I think the important point is that we search for the next > > record boundary until which we need to overwrite data such that it > > fits newly reserved record. "next record to be overwritten" isn't that > > important (we might never need to overwrite it). Important are those > > aspects of a) staying on record boundary and b) consuming enough > > records to reserve the new one. > > > > Can you please update the comment to mention the above points? > > > > Sure, I'll update the comment to: > > In overwrite mode, advance overwrite_pos when the ring buffer is full. > The key points are to stay on record boundaries and consume enough > records to fit the new one. > ok [...] > > >> + unsigned long rec_pos, > >> + unsigned long cons_pos, > >> + u32 len, u64 flags) > >> +{ > >> + unsigned long rec_end; > >> + > >> + if (flags & BPF_RB_FORCE_WAKEUP) > >> + return true; > >> + > >> + if (flags & BPF_RB_NO_WAKEUP) > >> + return false; > >> + > >> + /* for non-overwrite mode, if consumer caught up and is waiting for > >> + * our record, notify about new data availability > >> + */ > >> + if (likely(!rb->overwrite_mode)) > >> + return cons_pos == rec_pos; > >> + > >> + /* for overwrite mode, to give the consumer a chance to catch up > >> + * before being overwritten, wake up consumer every half a round > >> + * ahead. > >> + */ > >> + rec_end = rec_pos + ringbuf_round_up_hdr_len(len); > >> + > >> + cons_pos &= (rb->mask >> 1); > >> + rec_pos &= (rb->mask >> 1); > >> + rec_end &= (rb->mask >> 1); > >> + > >> + if (cons_pos == rec_pos) > >> + return true; > >> + > >> + if (rec_pos < cons_pos && cons_pos < rec_end) > >> + return true; > >> + > >> + if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < > >> rec_end)) > >> + return true; > >> + > > > > hm... ok, let's discuss this. Why do we need to do some half-round > > heuristic for overwrite mode? If a consumer is falling behind it > > should be actively trying to catch up and they don't need notification > > (that's the non-overwrite mode logic already). > > > > So there is more to this than a brief comment you left, can you please > > elaborate? > > > > The half-round wakeup was originally intended to work with libbpf in the > v1 version. In that version, libbpf used a retry loop to safely copy data > from the ring buffer that hadn’t been overwritten. By waking the consumer > once every half round, there was always a period where the consumer and > producer did not overlap, which helped reduce the number of retries. I can't say I completely grok the logic here, but do you think we should still keep this half-round wakeup? It looks like an arbitrary heuristic, so I'd rather not have it. > > > pw-bot: cr > > > >> + return false; > >> +} > >> + > >> +static __always_inline > > > > we didn't have always_inline before, any strong reason to add it now? > > > > I just wanted to avoid introducing any performance regression. Before this > patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but > after the patch it wasn’t, so I added always_inline explicitly to keep it > inlined. how big of a difference was it in benchmarks? It's generally frowned upon using __always_inline without a good reason. [...]

