[PATCH] docs: bcachefs: idle work scheduling design doc

Kent Overstreet Wed, 16 Apr 2025 08:06:34 -0700

People have been asking to see the plan for this, so -

bcachefs has various background tasks that need to be scheduled to
balance efficiency, predictability of performance, etc.


The design and philosophy hasn't changed too much since bcache, which
was primarily designed for server usage, with sustained load in mind.

These days we're seeing more desktop usage - where we really want to let
the system idle effictively, to reduce total power usage - while also
still balancing previous concerns, we still want to let work accumulate
to a degree.

This lays out all the requirements and starts to sketch out the
algorithm I have in mind.

Signed-off-by: Kent Overstreet <kent.overstr...@linux.dev>
---
 .../filesystems/bcachefs/future/idle_work.rst | 77 ++++++++++++++++++
 Documentation/filesystems/bcachefs/index.rst  | 31 -------
 fs/bcachefs/idle.h                            | 80 +++++++++++++++++++
 3 files changed, 157 insertions(+), 31 deletions(-)
 create mode 100644 Documentation/filesystems/bcachefs/future/idle_work.rst
 delete mode 100644 Documentation/filesystems/bcachefs/index.rst
 create mode 100644 fs/bcachefs/idle.h

diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst 
b/Documentation/filesystems/bcachefs/future/idle_work.rst
new file mode 100644
index 000000000000..e2afe45e9dc7
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/future/idle_work.rst
@@ -0,0 +1,77 @@
+Idle/background work classes design doc:
+
+Right now, our behaviour at idle isn't ideal, it was designed for servers that
+would be under sustained load, to keep pending work at a "medium" level, to
+let work build up so we can process it in more efficient batches, while also
+giving headroom for bursts in load.
+
+But for desktops or mobile - scenarios where work is less sustained and power
+usage is more important - we want to operate differently, with a "rush to
+idle" so the system can go to sleep. We don't want to be dribbling out
+background work while the system should be idle.
+
+The complicating factor is that there are a number of background tasks, which
+form a heirarchy (or a digraph, depending on how you divide it up) - one
+background task may generate work for another.
+
+Thus proper idle detection needs to model this heirarchy.
+
+- Foreground writes
+- Page cache writeback
+- Copygc, rebalance
+- Journal reclaim
+
+When we implement idle detection and rush to idle, we need to be careful not
+to disturb too much the existing behaviour that works reasonably well when the
+system is under sustained load (or perhaps improve it in the case of
+rebalance, which currently does not actively attempt to let work batch up).
+
+SUSTAINED LOAD REGIME
+---------------------
+
+When the system is under continuous load, we want these jobs to run
+continuously - this is perhaps best modelled with a P/D controller, where
+they'll be trying to keep a target value (i.e. fragmented disk space,
+available journal space) roughly in the middle of some range.
+
+The goal under sustained load is to balance our ability to handle load spikes
+without running out of x resource (free disk space, free space in the
+journal), while also letting some work accumululate to be batched (or become
+unnecessary).
+
+For example, we don't want to run copygc too aggressively, because then it
+will be evacuating buckets that would have become empty (been overwritten or
+deleted) anyways, and we don't want to wait until we're almost out of free
+space because then the system will behave unpredicably - suddenly we're doing
+a lot more work to service each write and the system becomes much slower.
+
+IDLE REGIME
+-----------
+
+When the system becomes idle, we should start flushing our pending work
+quicker so the system can go to sleep.
+
+Note that the definition of "idle" depends on where in the heirarchy a task
+is - a task should start flushing work more quickly when the task above it has
+stopped generating new work.
+
+e.g. rebalance should start flushing more quickly when page cache writeback is
+idle, and journal reclaim should only start flushing more quickly when both
+copygc and rebalance are idle.
+
+It's important to let work accumulate when more work is still incoming and we
+still have room, because flushing is always more efficient if we let it batch
+up. New writes may overwrite data before rebalance moves it, and tasks may be
+generating more updates for the btree nodes that journal reclaim needs to 
flush.
+
+On idle, how much work we do at each interval should be proportional to the
+length of time we have been idle for. If we're idle only for a short duration,
+we shouldn't flush everything right away; the system might wake up and start
+generating new work soon, and flushing immediately might end up doing a lot of
+work that would have been unnecessary if we'd allowed things to batch more.
+ 
+To summarize, we will need:
+- A list of classes for background tasks that generate work, which will
+  include one "foreground" class.
+- Tracking for each class - "Am I doing work, or have I gone to sleep?"
+- And each class should check the class above it when deciding how much work 
to issue.
diff --git a/Documentation/filesystems/bcachefs/index.rst 
b/Documentation/filesystems/bcachefs/index.rst
deleted file mode 100644
index 3864d0ae89c1..000000000000
--- a/Documentation/filesystems/bcachefs/index.rst
+++ /dev/null
@@ -1,31 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-======================
-bcachefs Documentation
-======================
-
-Subsystem-specific development process notes
---------------------------------------------
-
-Development notes specific to bcachefs. These are intended to supplement
-:doc:`general kernel development handbook </process/index>`.
-
-.. toctree::
-   :maxdepth: 1
-   :numbered:
-
-   CodingStyle
-   SubmittingPatches
-
-Filesystem implementation
--------------------------
-
-Documentation for filesystem features and their implementation details.
-At this moment, only a few of these are described here.
-
-.. toctree::
-   :maxdepth: 1
-   :numbered:
-
-   casefolding
-   errorcodes
diff --git a/fs/bcachefs/idle.h b/fs/bcachefs/idle.h
new file mode 100644
index 000000000000..eea49d879e3a
--- /dev/null
+++ b/fs/bcachefs/idle.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BCACHEFS_IDLE_H
+#define _BCACHEFS_IDLE_H
+
+/*
+ * Idle/background work classes:
+ *
+ * We have a number of background tasks (copygc, rebalance, journal reclaim).
+ *
+ * SUSTAINED LOAD REGIME
+ * ---------------------
+ *
+ * When the system is under continuous load, we want these jobs to run
+ * continuously - this is perhaps best modelled with a P/D controller, where
+ * they'll be trying to keep a target value (i.e. fragmented disk space,
+ * available journal space) roughly in the middle of some range.
+ *
+ * The goal under sustained load is to balance our ability to handle load 
spikes
+ * without running out of x resource (free disk space, free space in the
+ * journal), while also letting some work accumululate to be batched (or become
+ * unnecessary).
+ *
+ * For example, we don't want to run copygc too aggressively, because then it
+ * will be evacuating buckets that would have become empty (been overwritten or
+ * deleted) anyways, and we don't want to wait until we're almost out of free
+ * space because then the system will behave unpredicably - suddenly we're 
doing
+ * a lot more work to service each write and the system becomes much slower.
+ *
+ * IDLE REGIME
+ * -----------
+ *
+ * Many systems are however not under sustained load - they're idle most of the
+ * time, and the goal is to let them idle as much as possible because power
+ * useage is a prime consideration. Thus, we need to detect when we've been
+ * idle - and the longer we've been idle, the more pending work we should do;
+ * the goal being to complete all of our pending work as quickly as possible so
+ * that the system can go back to sleep.
+ *
+ * But this does not mean that we should do _all_ our pending work immediately
+ * when the system is idle; remember that if we allow work to build up, much
+ * work will not need to be done.
+ *
+ * Therefore when we're idle we want to wake up and do some amount of pending
+ * work in batches; increasing both the amount of work we do and the duration 
of
+ * our sleeps proportional to how long we've been idle for.
+ *
+ * CLASSES OF IDLE WORK
+ * --------------------
+ *
+ * There are levels of foreground and background tasks; a foreground operation
+ * (generated from outsisde the system, i.e. userspace) will generate work for
+ * the data move class and the journal reclaim class, and the data move class
+ * will generate more work for the journal reclaim class.
+ *
+ * This complicates idle detection, because a given class wants to know if
+ * everything above it has finished or is no longer running, and will want to
+ * behave differently for work above it coming from outside the system (which 
we
+ * cannot schedule and can only guess at based on past behaviour), versus work
+ * above it but from inside the system (which we can schedule).
+ *
+ * That is
+ * - data moves want to wake up when foreground operations have been quiet for
+ *   a little while
+ * - journal reclaim wants to wake up when foreground operations have been 
quiet
+ *   for a little while, and immediately after background data moves have
+ *   finished and gone back to sleep
+ */
+
+#define BCACHEFS_IDLE_CLASSES()                \
+       x(foreground)                   \
+       x(data_move)                    \
+       x(journal_reclaim)
+
+enum bch_idle_class {
+#define x(n)   BCH_IDLE_##n,
+       BCACHEFS_IDLE_CLASSES()
+#undef x
+};
+
+#endif /* _BCACHEFS_IDLE_H */
-- 
2.49.0

[PATCH] docs: bcachefs: idle work scheduling design doc

Reply via email to