While profiling kernel code on Meta's caching code[1], I found some lock
contention on pipe hot path, and I decide to investigate, and found that
anon_pipe_write() currently calls alloc_page() once per page while
holding pipe->mutex. The allocation can sleep doing direct reclaim and
runs memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex.

I would like to propose that we bulk pre-allocate pages outside
pipe->mutex in anon_pipe_write -- pre-allocates up to PIPE_PREALLOC_MAX
(8) pages with alloc_pages_bulk() before taking the lock

 * the get-page helper then consumes from tmp_page[] / the prealloc
   array first and only falls back to in-lock alloc_page() for writes
   larger than PIPE_PREALLOC_MAX (8) pages, which is rare.

I also vibe coded a microbenchmark to test the pipe benchmark. Basically
it wweeps writers x readers over {1,2,5} x {1,5,10} with 64KB writes
against a 1 MB pipe and prints throughput + latency percentiles per
config.

Measured on arm64 in virtme-ng (16 vCPUs, 64KB writes, 1 MB pipe).

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  
50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 
103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 
252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  
52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 
101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 
250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <[email protected]>
---
Breno Leitao (2):
      fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 |  40 +++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 351 ++++++++++++++++++++++++++++++
 5 files changed, 400 insertions(+), 2 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <[email protected]>


Reply via email to