On Sun 26-04-26 07:56:08, Jeff Layton wrote:
> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context. Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
>
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background. This moves writeback submission
> completely off the writer's hot path.
>
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the new NR_DONTCACHE_DIRTY counter to determine how many pages to write
> back. The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
>
> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.
>
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility, and target the correct cgroup writeback domain via
> unlocked_inode_to_wb_begin().
>
> dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> ~503 GB, compared to a v6.19-ish baseline):
>
> Single-client sequential write (MB/s):
> baseline patched change
> buffered 1449.8 1440.1 -0.7%
> dontcache 1347.9 1461.5 +8.4%
> direct 1450.0 1440.1 -0.7%
>
> Single-client sequential write latency (us):
> baseline patched change
> dontcache p50 3031.0 10551.3 +248.1%
> dontcache p99 74973.2 21626.9 -71.2%
> dontcache p99.9 85459.0 23199.7 -72.9%
>
> Single-client random write (MB/s):
> baseline patched change
> dontcache 284.2 295.4 +3.9%
>
> Single-client random write p99.9 latency (us):
> baseline patched change
> dontcache 2277.4 872.4 -61.7%
>
> Multi-writer aggregate throughput (MB/s):
> baseline patched change
> buffered 1619.5 1611.2 -0.5%
> dontcache 1281.1 1629.4 +27.2%
> direct 1545.4 1609.4 +4.1%
>
> Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> baseline patched change
> writer (MB/s) 1297.6 1471.1 +13.4%
> readers avg (MB/s) 855.0 462.4 -45.9%
>
> nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> file size ~502 GB, compared to v6.19-ish baseline):
>
> Single-client sequential write (MB/s):
> baseline patched change
> buffered 4844.2 4653.4 -3.9%
> dontcache 3028.3 3723.1 +22.9%
> direct 957.6 987.8 +3.2%
>
> Single-client sequential write p99.9 latency (us):
> baseline patched change
> dontcache 759169.0 175112.2 -76.9%
>
> Single-client random write (MB/s):
> baseline patched change
> dontcache 590.0 1561.0 +164.6%
>
> Multi-writer aggregate throughput (MB/s):
> baseline patched change
> buffered 9636.3 9422.9 -2.2%
> dontcache 1894.9 9442.6 +398.3%
> direct 809.6 975.1 +20.4%
>
> Noisy neighbor (dontcache writer + random readers):
> baseline patched change
> writer (MB/s) 1854.5 4063.6 +119.1%
> readers avg (MB/s) 131.2 101.6 -22.5%
>
> The NFS results show even larger improvements than the local benchmarks.
> Multi-writer dontcache throughput improves nearly 5x, matching buffered
> I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> buffered.
>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <[email protected]>
One comment regarding how the writeback is started:
> +static long wb_check_start_dontcache(struct bdi_writeback *wb)
> +{
> + long nr_pages;
> +
> + if (!test_bit(WB_start_dontcache, &wb->state))
> + return 0;
> +
> + nr_pages = global_node_page_state(NR_DONTCACHE_DIRTY);
> + if (nr_pages) {
> + struct wb_writeback_work work = {
> + .nr_pages = wb_split_bdi_pages(wb, nr_pages),
> + .sync_mode = WB_SYNC_NONE,
> + .range_cyclic = 1,
> + .reason = WB_REASON_DONTCACHE,
> + };
> +
> + nr_pages = wb_writeback(wb, &work);
> + }
> +
> + clear_bit(WB_start_dontcache, &wb->state);
> + return nr_pages;
> +}
So this will end up splitting NR_DONTCACHE_DIRTY folios among per-cgroup wb
structures based on their writeback bandwidth. This is a reasonable thing
for global writeback where the bandwidth more or less corresponds to the
amount of dirty folios. However with DONTCACHE I expect big differences in
among NR_DONTCACHE_DIRTY among different cgroups not necessarily
corresponding to wb throughput. In particular if you do DONTCACHE writes
from one cgroup and normal writes from another cgroup this will
systematically underestimate the amount needed to write by a factor of
about two.
So I think the stat should be a per bdi_writeback one (instead of a per node
one) which would avoid the need to split the value to wb here.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR