Hello Mateusz, On Sun, May 24, 2026 at 04:48:14PM +0200, Mateusz Guzik wrote: > On Sun, May 24, 2026 at 4:30 PM Breno Leitao <[email protected]> wrote: > > > > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote: > > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter > > > > *from) > > > > * after waiting we need to re-check whether the pipe > > > > * become empty while we dropped the lock. > > > > */ > > > > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > > > > mutex_unlock(&pipe->mutex); > > > > + anon_pipe_free_pages(&prealloc); > > > > > > Do we really want to call anon_pipe_free_pages() at this point? > > > > > > The main loop will continue when pipe_writable() becomes true again... > > > > I went back and forth on this. The argument for freeing was that > > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or > > stopped reader), and holding up the prealloc pages felt antisocial -- > > especially under the memory pressure this series targets, where those pages > > are > > more useful on the freelists than parked on a sleeping task. > > > > On the other side, on wakeup the loop is guaranteed to want pages again, and > > re-entering the allocator under the mutex puts us back in the contended > > state > > the patch removes. For any write() large enough to wait mid-syscall (which > > is > > the workload patch 2/2 measures), keeping them strictly wins on throughput / > > p99. > > > > You can still prealloc after wakeup for whatever reminder you got > though, but I can agree dropping these frees is a sensible way out and > it is easier and I'm not going to insist on one way or the other.
Ack. I've sent a v3 with anon_pipe_free_pages() and anon_pipe_refill_tmp_pages() dropped. > However, I think it would be prudent to add a tracepoint to some > machines on your fleet to find out how often they allocate pages under > the mutex (and for what i/o size). Initial alloc for the first write < > PAGE_SIZE definitely happens under the mutex which is probably not a > problem, but for anything later? > The tracepoint can have a trivial > indicator if this is the first write if that matters. One can Isn't this what I've reported earlier? https://lore.kernel.org/all/[email protected]/ Adding a tracepoint is harder than usual, given kernel rollout takes ages. But I hacked a bpftrace script and ran it on a random sample of fleet hosts (5 min each). As reported earlier, multi-page pipe writes are not uncommon: on one host a single long-running process produced 196,476 under-mutex alloc_page() calls in 5 minutes, with allocs-per-write distributions reaching 16+ -- exactly the pattern this patch removes. Most hosts sit at the boring ~20-30 allocs/sec dominated by one-page first-writes that the patch's `total_len <= PAGE_SIZE` early-return skips anyway, so the win is concentrated on the workloads that actually need it. None of the allocs hit reclaim during the trace I ran, but I would expect direct reclaim to happen with the lock held. Thanks for the review and direction, --breno

