On Thu, Feb 29, 2024 at 10:55:46AM -0500, Brian Foster wrote:
> So I can still reproduce the issue on master, which includes commit
> 3c8f22258ab ("bcachefs: Fix bch2_journal_noflush_seq()"). Just for
> reference, the related debug info is as follows:
>
> # cat /proc/13918/stack
> [<0>] bch2_dio_write_flush+0x15b/0x190 [bcachefs]
> [<0>] bch2_direct_write+0xb75/0xd30 [bcachefs]
> [<0>] bch2_write_iter+0x4c/0xf10 [bcachefs]
> [<0>] vfs_write+0x29b/0x470
> [<0>] ksys_write+0x6f/0xf0
> [<0>] do_syscall_64+0x86/0x170
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
>
> # cat journal_debug
> dirty journal entries: 0/32768
> seq: 72
> seq_ondisk: 72
> last_seq: 73
> last_seq_ondisk: 72
> flushed_seq_ondisk: 72
> watermark: stripe
> each entry reserved: 361
> nr flush writes: 65
> nr noflush writes: 0
> average write size: 2.14 KiB
> nr direct reclaim: 0
> nr background reclaim: 66
> reclaim kicked: 0
> reclaim runs in: 0 ms
> blocked: 0
> current entry sectors: 512
> current entry error: ok
> current entry: closed
> unwritten entries:
> last buf closed
> replay done: 1
> space:
> discarded 512:244736
> clean ondisk 512:244736
> clean 512:244736
> total 512:245760
> dev 0:
> nr 480
> bucket size 512
> available 478:188
> discard_idx 1
> dirty_ondisk 1 (seq 72)
> dirty_idx 1 (seq 72)
> cur_idx 1 (seq 72)
>
> # cat closures
> 0000000044781eb9: __closure_sync+0x49/0x180 -> closure_sync_fn+0x0/0x30 p
> 0000000000000000 r 1
> W bch2_journal_flush_seq_async.part.0+0xed/0x590 [bcachefs]
>
> 000000001b9a5601: bch2_fs_open+0x538/0x15e0 [bcachefs] -> 0x0 p
> 0000000000000000 r 1 R
So, definitely some sort of lost wakeup, or we accidentally waited on a
journal buf that wasn't actually outstanding