Greetings,
On 10/20/24 01:56, Kent Overstreet wrote:
copygc tries to wait in a way that balances waiting for work to
accumulate with running before we run out of free space - but for a
variety of reasons (multiple devices, io clock slop, the vagaries of
fragmentation) this isn't completely reliable.
So to avoid getting stuck, add direct wakeups from the allocator to the
copygc thread when we start to notice we're low on free buckets.
Since I switched to 6.11.x from 6.10.x, I've had "Allocator stuck?
Waited for 30 seconds" messages and I/O would stop to the FS. No timeout
on read, for example, but it just stops for hours, until I reboot. I'm
able to quickly and reliably trigger this with my workload.
I applied this patch on top of 6.11.4 but can still see "Allocator
stuck" in dmesg. I see the following before and after the patch:-
"BUG: unable to handle page fault for address: fffffffffffff81b
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page"
...
"RIP: 0010:bch2_btree_path_peek_slot+0x64/0x210 [bcachefs]"
A longer log snippet of "allocator stuck" and the above are at:
https://pastebin.com/ptuzaryi
I did fsck after FS got stuck, and errors were found and fixed, but
issue happens again, before and after the patch.
Some info that might be needed: I'm using ECC RAM, 2x SAS SSDs, 2x SATA
HDDs, LUKS, and the following opts:
starting version 1.12: rebalance_work_acct_fix
opts=metadata_replicas=2,data_replicas=2,metadata_replicas_required=2,data_replicas_required=2,
metadata_checksum=xxhash,data_checksum=xxhash,compression=lz4,background_compression=gzip,metadata_target=ssd,foreground_target=ssd,
background_target=hdd,promote_target=ssd
Let me know if I can help.
Thanks!
Ahmad