On Wed, Feb 28, 2024 at 07:02:39PM -0500, Kent Overstreet wrote:
> On Wed, Feb 28, 2024 at 03:13:04PM -0500, Brian Foster wrote:
> > On Wed, Feb 28, 2024 at 03:03:06PM -0500, Kent Overstreet wrote:
> > > On Wed, Feb 28, 2024 at 02:47:26PM -0500, Brian Foster wrote:
> > > > Hi Kent,
> > > >
> > > > Firstly, I confirmed that today's master seems to avoid the splat I sent
> > > > previously (re: your comment about a reverse journal replay patch or
> > > > some such).
> > > >
> > > > I still reproduce the stall issue on this system. After peeling away at
> > > > it, I was eventually able to reproduce without the drop writes
> > > > (dm-flakey) behavior from the test, and with fio using either the libaio
> > > > or sync I/O engine options. The sync I/O mode fortunately provides a
> > > > more useful stack trace:
> > > >
> > > > # cat /proc/177747/stack
> > > > [<0>] bch2_dio_write_flush+0x122/0x160 [bcachefs]
> > > > [<0>] bch2_direct_write+0xb53/0xce0 [bcachefs]
> > > > [<0>] bch2_write_iter+0x142/0xc70 [bcachefs]
> > > > [<0>] vfs_write+0x29b/0x470
> > > > [<0>] ksys_write+0x6f/0xf0
> > > > [<0>] do_syscall_64+0x86/0x170
> > > > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > > >
> > > > ... which resolves down to the closure_sync() call in
> > > > bch2_dio_write_flush(). The problem seems to go away if I remove the
> > > > preceding journal flush from that function. This seems to rule out
> > > > io_uring/aio and instead suggest that we're getting stuck somehow
> > > > waiting on a journal flush.
> > > >
> > > > Based on that I went back to the first commit before 746a33c96b7a0
> > > > ("bcachefs: better journal pipelining"). With that, I can run hundreds
> > > > of iterations of generic/703 without a problem, so this appears to be a
> > > > regression associated with the journal pipeline improvements. I'm
> > > > currently re-running on the last known good commit with my test tweaks
> > > > backed out (i.e. so back to io_uring and drop writes) just to
> > > > corroborate that it's the same problem, but so far it's running as
> > > > expected...
> > >
> > > So I suppose the journal must be getting stuck, and a journal write
> > > isn't completing - what does sysfs internal/journal_debug say when it
> > > happens?
> > >
> >
> > dirty journal entries: 0/32768
> > seq: 11
> > seq_ondisk: 11
> > last_seq: 12
> > last_seq_ondisk: 11
> > flushed_seq_ondisk: 11
> > watermark: stripe
> > each entry reserved: 346
> > nr flush writes: 10
> > nr noflush writes: 1
> > average write size: 10.9 KiB
> > nr direct reclaim: 0
> > nr background reclaim: 25
> > reclaim kicked: 0
> > reclaim runs in: 0 ms
> > current entry sectors: 512
> > current entry error: ok
> > current entry: closed
> > unwritten entries:
> > last buf closed
> > replay done: 1
> > space:
> > discarded 512:244736
> > clean ondisk 512:244736
> > clean 512:244736
> > total 512:245760
> > dev 0:
> > nr 480
> > bucket size 512
> > available 478:266
> > discard_idx 1
> > dirty_ondisk 1 (seq 11)
> > dirty_idx 1 (seq 11)
> > cur_idx 1 (seq 11)
> >
>
> I might see it - journal_last_unwritten_seq() is now wrong, it's
> assuming that j->seq_ondisk + 1 hasn't been submitted yet.
>
> bch2_journal_flush_seq_async() then sets that journal buf to "must be a
> flush write", but if it's already been submitted - whoops
On further reading, journal_flusH_seq_async() looks correct, but
bch2_journal_noflush_seq() was wrong - the following patch is in
-testing:
>From 3c8f22258ab39992ac8a4b077610a30c733f068d Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Wed, 28 Feb 2024 21:56:57 -0500
Subject: [PATCH] bcachefs: Fix bch2_journal_noflush_seq()
Improved journal pipelining broke journal_noflush_seq(); it implicitly
assumed only the oldest outstanding journal buf could be in flight, but
that's no longer true.
Make this more straightforward by just setting buf->must_flush whenever
we know a journal buf is going to be flush.
Signed-off-by: Kent Overstreet <[email protected]>
diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index ed1724b52204..3dc1534d1d3c 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -715,7 +715,7 @@ int bch2_journal_flush_seq_async(struct journal *j, u64 seq,
return ret;
seq = res.seq;
- buf = j->buf + (seq & JOURNAL_BUF_MASK);
+ buf = journal_seq_to_buf(j, seq);
buf->must_flush = true;
if (!buf->flush_time) {
@@ -733,8 +733,8 @@ int bch2_journal_flush_seq_async(struct journal *j, u64 seq,
}
/*
- * if write was kicked off without a flush, flush the next sequence
- * number instead
+ * if write was kicked off without a flush, or if we promised it
+ * wouldn't be a flush, flush the next sequence number instead
*/
buf = journal_seq_to_buf(j, seq);
if (buf->noflush) {
@@ -812,8 +812,8 @@ bool bch2_journal_noflush_seq(struct journal *j, u64 seq)
unwritten_seq++) {
struct journal_buf *buf = journal_seq_to_buf(j, unwritten_seq);
- /* journal write is already in flight, and was a flush write: */
- if (unwritten_seq == journal_last_unwritten_seq(j) &&
!buf->noflush)
+ /* journal flush already in flight, or flush requseted */
+ if (buf->must_flush)
goto out;
buf->noflush = true;
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 4557b5c1f7e1..77e93ad875d6 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1935,6 +1935,7 @@ static int bch2_journal_write_pick_flush(struct journal
*j, struct journal_buf *
j->nr_noflush_writes++;
} else {
+ w->must_flush = true;
j->last_flush_write = jiffies;
j->nr_flush_writes++;
clear_bit(JOURNAL_NEED_FLUSH_WRITE, &j->flags);