Hi Samuel,
Apologies for the long radio silence on this. I finally circled back to
this patch with a more robust benchmarking environment to get to the bottom
of the "deadlock vs. performance" question.
After extensive testing, my conclusion is that we should drop this patch.
Here is what I saw:
The Deadlock is Real: My original "deadlock" suspicion was correct. If
we rely on the helper to skip gaps, we trigger the buffering path
(get_page_buf), which allocates memory. Doing this inside the pager's write
path (which is called during memory pressure) causes a classic OOM deadlock.
The "Safe" Version is Slower: I rewrote the loop to be strictly
zero-copy (flushing and resetting on every gap to avoid allocation). While
this fixed the deadlock, it introduced a ~10-15% performance regression on
metadata-heavy workloads (e.g., tar -xf linux kernel tree) and a very
slugigsh UI feel while this was happening!
The Root Cause: I believe we are inadvertently serializing IO. The
standard pageout path allows multiple threads to fire requests to the disk
driver in parallel. By aggressively trying to batch them in a single
thread, we lose that concurrency, and the "fragmented" nature of metadata
means we rarely get large enough batches to make up for the loss.
The bulk strategy worked well for FILE_DATA (which is often contiguous),
but DISK metadata is just too scattered to benefit from this approach.
Thanks for the guidance on the reviews!
Best, Milos
On Mon, Sep 8, 2025 at 12:33 PM Samuel Thibault <[email protected]>
wrote:
> Hello,
>
> Milos Nikic, le lun. 08 sept. 2025 09:41:05 -0700, a ecrit:
> > I have actually experimenting with this is and for some reason we
> deadlock if
> > we don't stop at non consecutive blocks.
>
> ? What do you mean by "stop"? In your initial patch you would not
> completely stop, as in you would just call pending_blocks_write and
> continue piling blocks in pb. If you let pending_blocks_add do the
> pending_blocks_write call for you, there shouldn't be any difference?
>
> > I also did a bunch of testing like creating many directories and sparse
> files
> > hoping that I could get some big savings on the amount of times
> store_write was
> > called but I got to say I didn't.
> >
> > So either my tests are not great, or there isn't much benefit to gain
> here by
> > batching these writes for the disk case.
>
> Did you make some statistics on the occurrence of consecutive blocks? I
> would be not that surprised that the numbers are not very high since the
> metadata are not very large. It'll still be useful to commit something
> working.
>
> Samuel
>