Hello Mateusz,
Thanks for the careful review. Let me address the points inline,
with data I collected on three Meta production hosts: two aarch64 NVIDIA Grace
and one x86_64 Bergamo / EPYC 9D64, using bpftrace and perf lock contention.
On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> This change is quite a hammer.
Are you seeing any downside/trade-off with this current approach?
>
> Do you have a real workload where there are multiple processes issuing
> large ops to the same pipe on both reader and writer side?
>
> The only workload I'm aware of sharing the pipe between more than just
> one reader and one writer is anything with make, but that is mostly
> operating on 1-byte ops.
You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):
size writes %
<128 B 293602 98.0 (merge path, no alloc)
128 B - 4 KB 1108 0.4
4 - 8 KB 2149 0.72 <-- patch start helps from here
8 - 16 KB 3
16 - 32 KB 0
32 - 64 KB 5
64 - 128 KB 2 <-- up to 72KB writes
So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
contention count is a Meta fleet workload spraying sub-128-byte
metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
just hits the mutex very often.
But the writers that *do* take the alloc-under-mutex path are
real. Lock-wait data confirms it shows up:
perf lock contention -ab (30s, anon_pipe_* only):
contentions total wait max wait
arm64 (NVIDIA Grace)
anon_pipe_write 7660 4.39 ms 50 us
anon_pipe_read 7 55.78 us 17 us
x86_64 (Bergamo)
anon_pipe_write 3154 4.53 ms 79 us
anon_pipe_read 817 1.72 ms 82 us
The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
the symptom the patch targets: a multi-page writer holding
pipe->mutex across alloc_page() stalls the concurrent reader on
the same mutex.
bpftrace, mutex_lock duration inside anon_pipe_write, 300s:
Grace-A Grace-B Bergamo
[4K, 8K) 1090 2799 980
[8K, 16K) 1785 1353 953
[16K,32K) 755 623 530
[32K,64K) 153 172 184
[64K,128K) 13 8 59
[128K,256K) - - 4
[256K,512K) - - 1 <-- ~500us blocked
So the patch's value is shortening the long critical sections that
the 99% of small writers are blocked on -- the contention everyone
sees drops because the few multi-page writers do their alloc work
outside the lock.
>= 64KB, which are not common but existent.
> I'm asking because this would be better implemented taking into account
> how many pages are hanging out in the tmp_page array. With only one
> writer and one reader there is no concern someone will steal the pages.
>
> Preferably the tmp_page array would grow to accomodate more memory, but
> admittedly this runs into a problem where a pipe can linger unused
> indefinitely while hoarding several pages. As in, some shrinking
> mechanism would be needed so that's probably a no-go and preallocation
> is the way to go for now.
>
> Even then, this can definitely be integrated much better instead of
> being open-coded.
> This is some nasty golfing.
>
> Instead, you could have something like:
> struct tmp_page_prealloc {
> struct page *pages;
> unsigned int count;
> };
Agreed -- v2 will wrap the array+counter in
struct tmp_page_prealloc and drop the open-coded indexing.
> struct tmp_page_prealloc tpp = {};
> [...]
> This should move to a dedicated routine, perhaps:
> anon_get_page_prealloc(pipe, &tmp_page_prealloc);
>
> then it can also easily decide how much to prealloc based on existing
> state
Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
keep the policy in one place.
> > + while (prealloc_n)
> > + anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
>
> this results in put_page() calls under the mutex, still extending the hold
> time
>
> you could split this into 2 ops, for example:
> + anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
> if (pipe_is_full(pipe))
> wake_next_writer = false;
> mutex_unlock(&pipe->mutex);
> + anon_pipe_free_pages(pipe, &tmp_page_prealloc);
Well spotted!
> All that aside, I have no idea about performance impact on arm64, but
> there is *definitely* some throughput lost based on the fact that memory
> copy is restarted per page. I presume arm64 has an equivalent of amd64's
> SMAP which again will not be free and is paid for every time as well.
Can you say a bit more about what you'd want to see here? I want to make sure
I'm reading the suggestion the same way you are.
My understanding: anon_pipe_write loops once per page calling
copy_page_from_iter(), and each call re-enters the iov_iter
machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).
So an N-page write pays the user-access bracket N times instead of
once, on top of resetting the per-page memcpy and losing the
hardware prefetch streak.
The way I see it, is opening a single user_read_access_begin() region around
the page loop and use unsafe_copy_from_user() inside, with a labeled
fault-fallback path that closes the region and finishes via the slow path.
Is this what you meant?
Really thanks for your review,
--breno