Currently, pagecache writeback is performed by a single thread. Inodes
are added to a dirty list, and delayed writeback is triggered. The single
writeback thread then iterates through the dirty inode list, and executes
the writeback.
This series parallelizes the writeback by allowing multiple writeback
contexts per backing device (bdi). These writeback contexts are executed
as separate, independent threads, improving overall parallelism. Inodes
are distributed to these threads and are flushed in parallel.
This patchset applies cleanly on v6.17 kernel.
Design Overview
================
Following Jan Kara's suggestion [1], we have introduced a new bdi
writeback context within the backing_dev_info structure. Specifically,
we have created a new structure, bdi_writeback_context, which contains
its own set of members for each writeback context.
struct bdi_writeback_ctx {
struct bdi_writeback wb;
struct list_head wb_list; /* list of all wbs */
struct radix_tree_root cgwb_tree;
struct rw_semaphore wb_switch_rwsem;
wait_queue_head_t wb_waitq;
};
There can be multiple writeback contexts in a bdi, which helps in
achieving writeback parallelism.
struct backing_dev_info {
...
int nr_wb_ctx;
struct bdi_writeback_ctx **wb_ctx;
...
};
FS geometry and filesystem fragmentation
========================================
The community was concerned that parallelizing writeback would impact
delayed allocation and increase filesystem fragmentation.
Our analysis of XFS delayed allocation behavior showed that merging of
extents occurs within a specific inode. Earlier experiments with multiple
writeback contexts [2] resulted in increased fragmentation due to the
same inode being processed by different threads.
To mitigate this issue, we ensure that an inode is always associated
with a specific writeback context, allowing delayed allocation to
function effectively.
Number of writeback contexts
============================
We've implemented two interfaces to manage the number of writeback
contexts:
1) Sysfs Interface: As suggested by Christoph, we've added a sysfs
interface to allow users to adjust the number of writeback contexts
dynamically.
2) Filesystem Superblock Interface: We've also introduced a filesystem
superblock interface to retrieve the filesystem-specific number of
writeback contexts. For XFS, this count is set equal to the
allocation group count. When mounting a filesystem, we automatically
increase the number of writeback threads to match this count.
Resolving the Issue with Multiple Writebacks
============================================
For XFS, affining inodes to writeback threads resulted in a decline
in IOPS for certain devices. The issue was caused by AG lock contention
in xfs_end_io, where multiple writeback threads competed for the same
AG lock.
To address this, we now affine writeback threads to the allocation
group, resolving the contention issue. In best case allocation happens
from the same AG where inode metadata resides, avoiding lock contention.
Similar IOPS decline was observed with other filesystems under different
workloads. To avoid similar issues, we have decided to limit
parallelism to XFS only. Other filesystems can introduce parallelism
and distribute inodes as per their geometry.
IOPS and throughput
===================
With the affinity to allocation group we see significant improvement in
XFS when we write to multiple files in different directories(AGs).
Performance gains:
A) Workload 12 files each of 1G in 12 directories(AGs) - numjobs = 12
- NVMe device BM1743 SSD
Base XFS : 243 MiB/s
Parallel Writeback XFS : 759 MiB/s (+212%)
- NVMe device PM9A3 SSD
Base XFS : 368 MiB/s
Parallel Writeback XFS : 1634 MiB/s (+344%)
B) Workload 6 files each of 20G in 6 directories(AGs) - numjobs = 6
- NVMe device BM1743 SSD
Base XFS : 305 MiB/s
Parallel Writeback XFS : 706 MiB/s (+131%)
- NVMe device PM9A3 SSD
Base XFS : 315 MiB/s
Parallel Writeback XFS : 990 MiB/s (+214%)
Filesystem fragmentation
========================
We also see that there is no increase in filesystem fragmentation
Number of extents per file:
A) Workload 6 files each 1G in single directory(AG) - numjobs = 1
Base XFS : 17
Parallel Writeback XFS : 17
B) Workload 12 files each of 1G to 12 directories(AGs)- numjobs = 12
Base XFS : 166593
Parallel Writeback XFS : 161554
C) Workload 6 files each of 20G to 6 directories(AGs) - numjobs = 6
Base XFS : 3173716
Parallel Writeback XFS : 3364984
Testing using kdevops
=====================
1. fstests passed for XFS all profiles.
2. fstests passed for EXT4 and BTRFS also, these were tested for sanity.
Changes since v1:
- Parallel writeback enabled for XFS only for optimal performance
- Made writeback threads affined to allocation groups
- Increase the number of writebacks threads to AG count at mount
- Added sysfs entry to change the number of writebacks for a
bdi(Christoph)
- Added a filesystem interface to fetch 64 bit inode numbers (Christoph)
- Made common helpers to contain writeback specific changes, which were
affecting f2fs, fuse, gfs2 and nfs (Christoph)
- Changed name from wb_ctx_arr to wb_ctx (Andrew Morton)
Kundan Kumar (16):
writeback: add infra for parallel writeback
writeback: add support to initialize and free multiple writeback ctxs
writeback: link bdi_writeback to its corresponding bdi_writeback_ctx
writeback: affine inode to a writeback ctx within a bdi
writeback: modify bdi_writeback search logic to search across all wb
ctxs
writeback: invoke all writeback contexts for flusher and dirtytime
writeback
writeback: modify sync related functions to iterate over all writeback
contexts
writeback: add support to collect stats for all writeback ctxs
f2fs: add support in f2fs to handle multiple writeback contexts
fuse: add support for multiple writeback contexts in fuse
gfs2: add support in gfs2 to handle multiple writeback contexts
nfs: add support in nfs to handle multiple writeback contexts
writeback: configure the num of writeback contexts between 0 and
number of online cpus
writeback: segregated allocation and free of writeback contexts
writeback: added support to change the number of writebacks using a
sysfs attribute
writeback: added XFS support for matching writeback count to
allocation group count
fs/f2fs/node.c | 4 +-
fs/f2fs/segment.h | 2 +-
fs/fs-writeback.c | 148 +++++++----
fs/fuse/file.c | 7 +-
fs/gfs2/super.c | 2 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 4 +-
fs/super.c | 23 ++
fs/xfs/xfs_super.c | 15 ++
include/linux/backing-dev-defs.h | 32 ++-
include/linux/backing-dev.h | 79 +++++-
include/linux/fs.h | 3 +-
mm/backing-dev.c | 412 +++++++++++++++++++++++++------
mm/page-writeback.c | 13 +-
14 files changed, 581 insertions(+), 165 deletions(-)
--
2.25.1
_______________________________________________
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel