On Wed, Feb 28, 2024 at 07:02:39PM -0500, Kent Overstreet wrote:
> On Wed, Feb 28, 2024 at 03:13:04PM -0500, Brian Foster wrote:
> > On Wed, Feb 28, 2024 at 03:03:06PM -0500, Kent Overstreet wrote:
> > > On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> > > > Hi Kent,
> > > > 
> > > > Firstly, I confirmed that today's master seems to avoid the splat I sent
> > > > previously (re: your comment about a reverse journal replay patch or
> > > > some such).
> > > > 
> > > > I still reproduce the stall issue on this system. After peeling away at
> > > > it, I was eventually able to reproduce without the drop writes
> > > > (dm-flakey) behavior from the test, and with fio using either the libaio
> > > > or sync I/O engine options. The sync I/O mode fortunately provides a
> > > > more useful stack trace:
> > > > 
> > > > # cat /proc/177747/stack 
> > > > [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> > > > [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> > > > [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> > > > [<0>] vfs_write+0x29b/0x470
> > > > [<0>] ksys_write+0x6f/0xf0
> > > > [<0>] do_syscall_64+0x86/0x170
> > > > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > > > 
> > > > ... which resolves down to the closure_sync() call in
> > > > bch2_dio_write_flush(). The problem seems to go away if I remove the
> > > > preceding journal flush from that function. This seems to rule out
> > > > io_uring/aio and instead suggest that we're getting stuck somehow
> > > > waiting on a journal flush.
> > > > 
> > > > Based on that I went back to the first commit before 746a33c96b7a0
> > > > ("bcachefs: better journal pipelining"). With that, I can run hundreds
> > > > of iterations of generic/703 without a problem, so this appears to be a
> > > > regression associated with the journal pipeline improvements. I'm
> > > > currently re-running on the last known good commit with my test tweaks
> > > > backed out (i.e. so back to io_uring and drop writes) just to
> > > > corroborate that it's the same problem, but so far it's running as
> > > > expected...
> > > 
> > > So I suppose the journal must be getting stuck, and a journal write
> > > isn't completing - what does sysfs internal/journal_debug say when it 
> > > happens?
> > > 
> > 
> > dirty journal entries:  0/32768
> > seq:                    11
> > seq_ondisk:             11
> > last_seq:               12
> > last_seq_ondisk:        11
> > flushed_seq_ondisk:     11
> > watermark:              stripe
> > each entry reserved:    346
> > nr flush writes:        10
> > nr noflush writes:      1
> > average write size:     10.9 KiB
> > nr direct reclaim:      0
> > nr background reclaim:  25
> > reclaim kicked:         0
> > reclaim runs in:        0 ms
> > current entry sectors:  512
> > current entry error:    ok
> > current entry:          closed
> > unwritten entries:
> > last buf closed
> > replay done:            1
> > space:
> >         discarded       512:244736
> >         clean ondisk    512:244736
> >         clean           512:244736
> >         total           512:245760
> > dev 0:
> >         nr              480
> >         bucket size     512
> >         available       478:266
> >         discard_idx     1
> >         dirty_ondisk    1 (seq 11)
> >         dirty_idx       1 (seq 11)
> >         cur_idx         1 (seq 11)
> > 
> 
> I might see it - journal_last_unwritten_seq() is now wrong, it's
> assuming that j->seq_ondisk + 1 hasn't been submitted yet.
> 
> bch2_journal_flush_seq_async() then sets that journal buf to "must be a
> flush write", but if it's already been submitted - whoops

On further reading, journal_flusH_seq_async() looks correct, but
bch2_journal_noflush_seq() was wrong - the following patch is in
-testing:


>From 3c8f22258ab39992ac8a4b077610a30c733f068d Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Wed, 28 Feb 2024 21:56:57 -0500
Subject: [PATCH] bcachefs: Fix bch2_journal_noflush_seq()

Improved journal pipelining broke journal_noflush_seq(); it implicitly
assumed only the oldest outstanding journal buf could be in flight, but
that's no longer true.

Make this more straightforward by just setting buf->must_flush whenever
we know a journal buf is going to be flush.

Signed-off-by: Kent Overstreet <[email protected]>

diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index ed1724b52204..3dc1534d1d3c 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -715,7 +715,7 @@ int bch2_journal_flush_seq_async(struct journal *j, u64 seq,
                        return ret;
 
                seq = res.seq;
-               buf = j->buf + (seq & JOURNAL_BUF_MASK);
+               buf = journal_seq_to_buf(j, seq);
                buf->must_flush = true;
 
                if (!buf->flush_time) {
@@ -733,8 +733,8 @@ int bch2_journal_flush_seq_async(struct journal *j, u64 seq,
        }
 
        /*
-        * if write was kicked off without a flush, flush the next sequence
-        * number instead
+        * if write was kicked off without a flush, or if we promised it
+        * wouldn't be a flush, flush the next sequence number instead
         */
        buf = journal_seq_to_buf(j, seq);
        if (buf->noflush) {
@@ -812,8 +812,8 @@ bool bch2_journal_noflush_seq(struct journal *j, u64 seq)
             unwritten_seq++) {
                struct journal_buf *buf = journal_seq_to_buf(j, unwritten_seq);
 
-               /* journal write is already in flight, and was a flush write: */
-               if (unwritten_seq == journal_last_unwritten_seq(j) && 
!buf->noflush)
+               /* journal flush already in flight, or flush requseted */
+               if (buf->must_flush)
                        goto out;
 
                buf->noflush = true;
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 4557b5c1f7e1..77e93ad875d6 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1935,6 +1935,7 @@ static int bch2_journal_write_pick_flush(struct journal 
*j, struct journal_buf *
 
                j->nr_noflush_writes++;
        } else {
+               w->must_flush = true;
                j->last_flush_write = jiffies;
                j->nr_flush_writes++;
                clear_bit(JOURNAL_NEED_FLUSH_WRITE, &j->flags);

Reply via email to