On Wed, May 20, 2026 at 7:09 PM Breno Leitao <[email protected]> wrote:
> On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> > This change is quite a hammer.
>
> Are you seeing any downside/trade-off with this current approach?
>
Well there is a lot to say about the current code to begin with, I'm
going to refrain from writing long rants here as that's of no use in
this thread.
The idea behind the patch has a lot of merit and I'm definitely not
protesting it or denying there is a problem.
> >
> > Do you have a real workload where there are multiple processes issuing
> > large ops to the same pipe on both reader and writer side?
> >
> > The only workload I'm aware of sharing the pipe between more than just
> > one reader and one writer is anything with make, but that is mostly
> > operating on 1-byte ops.
>
> You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
> anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):
>
> size writes %
> <128 B 293602 98.0 (merge path, no alloc)
> 128 B - 4 KB 1108 0.4
> 4 - 8 KB 2149 0.72 <-- patch start helps from here
> 8 - 16 KB 3
> 16 - 32 KB 0
> 32 - 64 KB 5
> 64 - 128 KB 2 <-- up to 72KB writes
>
> So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
> contention count is a Meta fleet workload spraying sub-128-byte
> metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
> just hits the mutex very often.
>
> But the writers that *do* take the alloc-under-mutex path are
> real. Lock-wait data confirms it shows up:
>
> perf lock contention -ab (30s, anon_pipe_* only):
>
> contentions total wait max wait
> arm64 (NVIDIA Grace)
> anon_pipe_write 7660 4.39 ms 50 us
> anon_pipe_read 7 55.78 us 17 us
> x86_64 (Bergamo)
> anon_pipe_write 3154 4.53 ms 79 us
> anon_pipe_read 817 1.72 ms 82 us
>
> The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
> the symptom the patch targets: a multi-page writer holding
> pipe->mutex across alloc_page() stalls the concurrent reader on
> the same mutex.
>
> bpftrace, mutex_lock duration inside anon_pipe_write, 300s:
>
> Grace-A Grace-B Bergamo
> [4K, 8K) 1090 2799 980
> [8K, 16K) 1785 1353 953
> [16K,32K) 755 623 530
> [32K,64K) 153 172 184
> [64K,128K) 13 8 59
> [128K,256K) - - 4
> [256K,512K) - - 1 <-- ~500us blocked
>
> So the patch's value is shortening the long critical sections that
> the 99% of small writers are blocked on -- the contention everyone
> sees drops because the few multi-page writers do their alloc work
> outside the lock.
>
> >= 64KB, which are not common but existent.
>
Ye this makes sense to me.
> > This should move to a dedicated routine, perhaps:
> > anon_get_page_prealloc(pipe, &tmp_page_prealloc);
> >
> > then it can also easily decide how much to prealloc based on existing
> > state
>
> Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
> top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
> keep the policy in one place.
>
Note I lost the 'pipe' word. This definitely needs to be anon_pipe_-prefixed.
> > All that aside, I have no idea about performance impact on arm64, but
> > there is *definitely* some throughput lost based on the fact that memory
> > copy is restarted per page. I presume arm64 has an equivalent of amd64's
> > SMAP which again will not be free and is paid for every time as well.
>
> Can you say a bit more about what you'd want to see here? I want to make sure
> I'm reading the suggestion the same way you are.
>
> My understanding: anon_pipe_write loops once per page calling
> copy_page_from_iter(), and each call re-enters the iov_iter
> machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).
I was told arm64 has an explicit instruction set to deal with user
memory, so it might be this is implemented differently there and
fixing the problem might happen to not alter perf on that arch. It
will definitely help on amd64.
> So an N-page write pays the user-access bracket N times instead of
> once, on top of resetting the per-page memcpy and losing the
> hardware prefetch streak.
>
This bit is unavoidable with the current implementation as backing
pages are not guaranteed to be even virtually contiguous.
The extra SMAP trips however can be avoided.
> The way I see it, is opening a single user_read_access_begin() region around
> the page loop and use unsafe_copy_from_user() inside, with a labeled
> fault-fallback path that closes the region and finishes via the slow path.
>
> Is this what you meant?
>
Yes indeed. This is however optional and even if implemented it should
be a separate patch.
As a nit you could sprinkle some predicts in a separate commit, for
example "if (!pipe->readers) {" definitely warrants an unlikely().
The real deal is the following: the wakeup policy is to *always*
wakeup *all* readers if any data is available and *all* writers if any
space is available. In case of heavily shared usage this leads to the
thundering herd problem (and it is causing performance issues with the
BSD and GNU makes). While this cannot be changed by default, I was
thinking about an opt-in fcntl which changes to policy to only wake up
one waiter.
In case of a microbenchmark like the one your provided here such a
policy might end up reducing throughput as there will be more time
spent off cpu by the benchmark. However, in case of real workloads
which presumably have stuff to do, it should be a win.
That is to say, if you have 1 reader and 1 writer, this is irrelevant.
But if you have multiple and they block on the pipe, there might be a
win hiding here for your actual workload by utilizing the above.
Something to consider.