On Sun, May 24, 2026 at 4:30 PM Breno Leitao <[email protected]> wrote: > > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote: > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter > > > *from) > > > * after waiting we need to re-check whether the pipe > > > * become empty while we dropped the lock. > > > */ > > > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > > > mutex_unlock(&pipe->mutex); > > > + anon_pipe_free_pages(&prealloc); > > > > Do we really want to call anon_pipe_free_pages() at this point? > > > > The main loop will continue when pipe_writable() becomes true again... > > I went back and forth on this. The argument for freeing was that > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or > stopped reader), and holding up the prealloc pages felt antisocial -- > especially under the memory pressure this series targets, where those pages > are > more useful on the freelists than parked on a sleeping task. > > On the other side, on wakeup the loop is guaranteed to want pages again, and > re-entering the allocator under the mutex puts us back in the contended state > the patch removes. For any write() large enough to wait mid-syscall (which is > the workload patch 2/2 measures), keeping them strictly wins on throughput / > p99. >
You can still prealloc after wakeup for whatever reminder you got though, but I can agree dropping these frees is a sensible way out and it is easier and I'm not going to insist on one way or the other. However, I think it would be prudent to add a tracepoint to some machines on your fleet to find out how often they allocate pages under the mutex (and for what i/o size). Initial alloc for the first write < PAGE_SIZE definitely happens under the mutex which is probably not a problem, but for anything later? The tracepoint can have a trivial indicator if this is the first write if that matters. One can speculate all day, nothing beats checking what happens.

