On Tue, Feb 27, 2024 at 10:49:46AM -0500, Brian Foster wrote: > On Sat, Feb 24, 2024 at 09:38:04PM -0500, Kent Overstreet wrote: > > Until accounting keys hit the btree, they are deltas, not new versions > > of the existing key; this means we have to teach journal replay to > > accumulate them. > > > > Additionally, the journal doesn't track precisely which entries have > > been flushed to the btree; it only tracks a range of entries that may > > possibly still need to be flushed. > > > > That means we need to compare accounting keys against the version in the > > btree and only flush updates that are newer. > > > > There's another wrinkle with the write buffer: if the write buffer > > starts flushing accounting keys before journal replay has finished > > flushing accounting keys, journal replay will see the version number > > from the new updates and updates from the journal will be lost. > > > > To avoid this, journal replay has to flush accounting keys first, and > > we'll be adding a flag so that write buffer flush knows to hold > > accounting keys until then. > > > > Signed-off-by: Kent Overstreet <[email protected]> > > --- > > fs/bcachefs/btree_journal_iter.c | 23 +++------- > > fs/bcachefs/btree_journal_iter.h | 15 +++++++ > > fs/bcachefs/btree_trans_commit.c | 9 +++- > > fs/bcachefs/btree_update.h | 14 +++++- > > fs/bcachefs/recovery.c | 76 +++++++++++++++++++++++++++++++- > > 5 files changed, 117 insertions(+), 20 deletions(-) > > > ... > > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c > > index 96e7a1ec7091..6829d80bd181 100644 > > --- a/fs/bcachefs/recovery.c > > +++ b/fs/bcachefs/recovery.c > > @@ -11,6 +11,7 @@ > > #include "btree_io.h" > > #include "buckets.h" > > #include "dirent.h" > > +#include "disk_accounting.h" > > #include "ec.h" > > #include "errcode.h" > > #include "error.h" > > @@ -87,6 +88,56 @@ static void replay_now_at(struct journal *j, u64 seq) > > bch2_journal_pin_put(j, j->replay_journal_seq++); > > } > > > > +static int bch2_journal_replay_accounting_key(struct btree_trans *trans, > > + struct journal_key *k) > > +{ > > + struct journal_keys *keys = &trans->c->journal_keys; > > + > > + struct btree_iter iter; > > + bch2_trans_node_iter_init(trans, &iter, k->btree_id, k->k->k.p, > > + BTREE_MAX_DEPTH, k->level, > > + BTREE_ITER_INTENT); > > + int ret = bch2_btree_iter_traverse(&iter); > > + if (ret) > > + goto out; > > + > > + struct bkey u; > > + struct bkey_s_c old = bch2_btree_path_peek_slot(btree_iter_path(trans, > > &iter), &u); > > + > > + if (bversion_cmp(old.k->version, k->k->k.version) >= 0) { > > + ret = 0; > > + goto out; > > + } > > So I assume this is what correlates back to the need to not flush the > write buffer until replay completes, otherwise we could unintentionally > skip subsequent key updates. Is that the case?
No, this is the "has this delta been applie to the btree key" check - adding that as a comment. Write buffer exclusion comes with a new filesytem bit that gets set once accounting keys have all been replayed, that's in the next patch
