Currently, pagecache writeback is performed by a single thread. Inodes
are added to a dirty list, and delayed writeback is triggered. The single
writeback thread then iterates through the dirty inode list, and executes
the writeback.

This series parallelizes the writeback by allowing multiple writeback
contexts per backing device (bdi). These writeback contexts are executed
as separate, independent threads, improving overall parallelism. Inodes
are distributed to these threads and are flushed in parallel.

This patchset applies cleanly on v6.17 kernel.

Design Overview
================
Following Jan Kara's suggestion [1], we have introduced a new bdi
writeback context within the backing_dev_info structure. Specifically,
we have created a new structure, bdi_writeback_context, which contains
its own set of members for each writeback context.

struct bdi_writeback_ctx {
        struct bdi_writeback wb;
        struct list_head wb_list; /* list of all wbs */
        struct radix_tree_root cgwb_tree;
        struct rw_semaphore wb_switch_rwsem;
        wait_queue_head_t wb_waitq;
};

There can be multiple writeback contexts in a bdi, which helps in
achieving writeback parallelism.

struct backing_dev_info {
...
        int nr_wb_ctx;
        struct bdi_writeback_ctx **wb_ctx;
...
};

FS geometry and filesystem fragmentation
========================================
The community was concerned that parallelizing writeback would impact
delayed allocation and increase filesystem fragmentation.
Our analysis of XFS delayed allocation behavior showed that merging of
extents occurs within a specific inode. Earlier experiments with multiple
writeback contexts [2] resulted in increased fragmentation due to the
same inode being processed by different threads.

To mitigate this issue, we ensure that an inode is always associated
with a specific writeback context, allowing delayed allocation to
function effectively.

Number of writeback contexts
============================
We've implemented two interfaces to manage the number of writeback
contexts:
1) Sysfs Interface: As suggested by Christoph, we've added a sysfs
   interface to allow users to adjust the number of writeback contexts
   dynamically.
2) Filesystem Superblock Interface: We've also introduced a filesystem
   superblock interface to retrieve the filesystem-specific number of
   writeback contexts. For XFS, this count is set equal to the
   allocation group count. When mounting a filesystem, we automatically
   increase the number of writeback threads to match this count.

Resolving the Issue with Multiple Writebacks
============================================
For XFS, affining inodes to writeback threads resulted in a decline
in IOPS for certain devices. The issue was caused by AG lock contention
in xfs_end_io, where multiple writeback threads competed for the same
AG lock.
To address this, we now affine writeback threads to the allocation
group, resolving the contention issue. In best case allocation happens
from the same AG where inode metadata resides, avoiding lock contention.

Similar IOPS decline was observed with other filesystems under different
workloads. To avoid similar issues, we have decided to limit
parallelism to XFS only. Other filesystems can introduce parallelism
and distribute inodes as per their geometry.

IOPS and throughput
===================
With the affinity to allocation group we see significant improvement in
XFS when we write to multiple files in different directories(AGs).

Performance gains:
  A) Workload 12 files each of 1G in 12 directories(AGs) - numjobs = 12
    - NVMe device BM1743 SSD
        Base XFS                : 243 MiB/s
        Parallel Writeback XFS  : 759 MiB/s  (+212%)

    - NVMe device PM9A3 SSD
        Base XFS                : 368 MiB/s
        Parallel Writeback XFS  : 1634 MiB/s  (+344%)

  B) Workload 6 files each of 20G in 6 directories(AGs)  - numjobs = 6
    - NVMe device BM1743 SSD
        Base XFS                : 305 MiB/s
        Parallel Writeback XFS  : 706 MiB/s  (+131%)

    - NVMe device PM9A3 SSD
        Base XFS                : 315 MiB/s
        Parallel Writeback XFS  : 990 MiB/s  (+214%)

Filesystem fragmentation
========================
We also see that there is no increase in filesystem fragmentation
Number of extents per file:
  A) Workload 6 files each 1G in single directory(AG)   - numjobs = 1
        Base XFS                : 17
        Parallel Writeback XFS  : 17

  B) Workload 12 files each of 1G to 12 directories(AGs)- numjobs = 12
        Base XFS                : 166593
        Parallel Writeback XFS  : 161554

  C) Workload 6 files each of 20G to 6 directories(AGs) - numjobs = 6
        Base XFS                : 3173716
        Parallel Writeback XFS  : 3364984

Testing using kdevops
=====================
1. fstests passed for XFS all profiles.
2. fstests passed for EXT4 and BTRFS also, these were tested for sanity.

Changes since v1:
 - Parallel writeback enabled for XFS only for optimal performance
 - Made writeback threads affined to allocation groups
 - Increase the number of writebacks threads to AG count at mount
 - Added sysfs entry to change the number of writebacks for a
   bdi(Christoph)
 - Added a filesystem interface to fetch 64 bit inode numbers (Christoph)
 - Made common helpers to contain writeback specific changes, which were
   affecting f2fs, fuse, gfs2 and nfs (Christoph)
 - Changed name from wb_ctx_arr to wb_ctx (Andrew Morton)

Kundan Kumar (16):
  writeback: add infra for parallel writeback
  writeback: add support to initialize and free multiple writeback ctxs
  writeback: link bdi_writeback to its corresponding bdi_writeback_ctx
  writeback: affine inode to a writeback ctx within a bdi
  writeback: modify bdi_writeback search logic to search across all wb
    ctxs
  writeback: invoke all writeback contexts for flusher and dirtytime
    writeback
  writeback: modify sync related functions to iterate over all writeback
    contexts
  writeback: add support to collect stats for all writeback ctxs
  f2fs: add support in f2fs to handle multiple writeback contexts
  fuse: add support for multiple writeback contexts in fuse
  gfs2: add support in gfs2 to handle multiple writeback contexts
  nfs: add support in nfs to handle multiple writeback contexts
  writeback: configure the num of writeback contexts between 0 and
    number of online cpus
  writeback: segregated allocation and free of writeback contexts
  writeback: added support to change the number of writebacks using a
    sysfs attribute
  writeback: added XFS support for matching writeback count to
    allocation group count

 fs/f2fs/node.c                   |   4 +-
 fs/f2fs/segment.h                |   2 +-
 fs/fs-writeback.c                | 148 +++++++----
 fs/fuse/file.c                   |   7 +-
 fs/gfs2/super.c                  |   2 +-
 fs/nfs/internal.h                |   2 +-
 fs/nfs/write.c                   |   4 +-
 fs/super.c                       |  23 ++
 fs/xfs/xfs_super.c               |  15 ++
 include/linux/backing-dev-defs.h |  32 ++-
 include/linux/backing-dev.h      |  79 +++++-
 include/linux/fs.h               |   3 +-
 mm/backing-dev.c                 | 412 +++++++++++++++++++++++++------
 mm/page-writeback.c              |  13 +-
 14 files changed, 581 insertions(+), 165 deletions(-)

-- 
2.25.1



_______________________________________________
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Reply via email to