On Wed, Jun 17, 2026 at 12:24 PM Breno Leitao <[email protected]> wrote: > > On Wed, Jun 17, 2026 at 10:52:40AM +0200, Oleg Nesterov wrote: > > On 06/16, Josh Triplett wrote: > > > > > > On Sun, May 24, 2026 at 07:44:57AM -0700, Breno Leitao wrote: > > > > This series pre-allocates pages outside pipe->mutex in > > > > anon_pipe_write(): for writes that span more than one full page, up > > > > to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page > > > > alloc_page() loop before the mutex is taken. anon_pipe_get_page() > > > > then drains the prealloc array first, falls back to the per-pipe > > > > tmp_page[] cache, and only enters the allocator under the mutex for > > > > the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page > > > > writes that skip prealloc, or shortfalls when the prealloc loop > > > > fails). Leftover prealloc pages are recycled into tmp_page[] before > > > > unlock and any remainder is put_page()'d after unlock, keeping the > > > > allocator out of the critical section on both sides. > > > [...] > > > > I also vibe-coded a microbenchmark to validate the change. It sweeps > > > > writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a > > > > 1 MB pipe and prints throughput + latency percentiles per config. > > > > > > How do the numbers compare with 1-byte writes/reads? (It's fine if > > > they're not *faster*, just want to make sure they don't get any > > > *worse*. This case comes up a lot with pipes used for synchronization or > > > event reporting, such as with make.) > > > > Note the "for writes that span more than one full page" above. Pre-allocate > > does nothing if total_len <= PAGE_SIZE. > > Exactly. > > > The pre-allocation only triggers for multi-page writes: > > anon_pipe_get_page_prealloc() returns immediately when total_len <= PAGE_SIZE, > so a 1-byte (or any sub-page) write never enters the new path. > > anon_pipe_get_page() then falls through to the existing tmp_page/alloc_page > logic exactly as before; the only added cost is one length check and a NULL > prealloc pop, both trivially predicted. > > Measured it to _just be sure_, 1-byte ping-pong (perf bench sched pipe -s 1): > > baseline: 2.674 usecs/op > patched: 2.710 usecs/op (+1.3%, within run-to-run noise) > > --breno
There are trivial touch ups which can be done by adding a bunch of predicts and inlining kill_fasync if someone can be bothered.

