mutex contention by pre-allocating outside the lock

Mateusz Guzik Wed, 17 Jun 2026 05:10:01 -0700

On Wed, Jun 17, 2026 at 12:24 PM Breno Leitao <[email protected]> wrote:
>
> On Wed, Jun 17, 2026 at 10:52:40AM +0200, Oleg Nesterov wrote:
> > On 06/16, Josh Triplett wrote:
> > >
> > > On Sun, May 24, 2026 at 07:44:57AM -0700, Breno Leitao wrote:
> > > > This series pre-allocates pages outside pipe->mutex in
> > > > anon_pipe_write(): for writes that span more than one full page, up
> > > > to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> > > > alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> > > > then drains the prealloc array first, falls back to the per-pipe
> > > > tmp_page[] cache, and only enters the allocator under the mutex for
> > > > the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> > > > writes that skip prealloc, or shortfalls when the prealloc loop
> > > > fails). Leftover prealloc pages are recycled into tmp_page[] before
> > > > unlock and any remainder is put_page()'d after unlock, keeping the
> > > > allocator out of the critical section on both sides.
> > > [...]
> > > > I also vibe-coded a microbenchmark to validate the change. It sweeps
> > > > writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> > > > 1 MB pipe and prints throughput + latency percentiles per config.
> > >
> > > How do the numbers compare with 1-byte writes/reads? (It's fine if
> > > they're not *faster*, just want to make sure they don't get any
> > > *worse*. This case comes up a lot with pipes used for synchronization or
> > > event reporting, such as with make.)
> >
> > Note the "for writes that span more than one full page" above. Pre-allocate
> > does nothing if total_len <= PAGE_SIZE.
>
> Exactly.
>
>
> The pre-allocation only triggers for multi-page writes:
>
> anon_pipe_get_page_prealloc() returns immediately when total_len <= PAGE_SIZE,
> so a 1-byte (or any sub-page) write never enters the new path.
>
> anon_pipe_get_page() then falls through to the existing tmp_page/alloc_page
> logic exactly as before; the only added cost is one length check and a NULL
> prealloc pop, both trivially predicted.
>
> Measured it to _just be sure_, 1-byte ping-pong (perf bench sched pipe -s 1):
>
>     baseline:  2.674 usecs/op
>     patched:   2.710 usecs/op   (+1.3%, within run-to-run noise)
>
> --breno


There are trivial touch ups which can be done by adding a bunch of
predicts and inlining kill_fasync if someone can be bothered.

Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock

Reply via email to