Re: [PATCH] bcachefs: Allocator now directly wakes up copygc when necessary

Ahmad Draidi Wed, 23 Oct 2024 20:53:41 -0700

Greetings,


On 10/20/24 01:56, Kent Overstreet wrote:

copygc tries to wait in a way that balances waiting for work to
accumulate with running before we run out of free space - but for a
variety of reasons (multiple devices, io clock slop, the vagaries of
fragmentation) this isn't completely reliable.

So to avoid getting stuck, add direct wakeups from the allocator to the
copygc thread when we start to notice we're low on free buckets.

Since I switched to 6.11.x from 6.10.x, I've had "Allocator stuck?Waited for 30 seconds" messages and I/O would stop to the FS. No timeouton read, for example, but it just stops for hours, until I reboot. I'mable to quickly and reliably trigger this with my workload.

I applied this patch on top of 6.11.4 but can still see "Allocatorstuck" in dmesg. I see the following before and after the patch:-


"BUG: unable to handle page fault for address: fffffffffffff81b
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page"

...

"RIP: 0010:bch2_btree_path_peek_slot+0x64/0x210 [bcachefs]"

A longer log snippet of "allocator stuck" and the above are at:https://pastebin.com/ptuzaryi

I did fsck after FS got stuck, and errors were found and fixed, butissue happens again, before and after the patch.

Some info that might be needed: I'm using ECC RAM, 2x SAS SSDs, 2x SATAHDDs, LUKS, and the following opts:

starting version 1.12: rebalance_work_acct_fixopts=metadata_replicas=2,data_replicas=2,metadata_replicas_required=2,data_replicas_required=2,


metadata_checksum=xxhash,data_checksum=xxhash,compression=lz4,background_compression=gzip,metadata_target=ssd,foreground_target=ssd,

background_target=hdd,promote_target=ssd


Let me know if I can help.


Thanks!

Ahmad

Re: [PATCH] bcachefs: Allocator now directly wakes up copygc when necessary

Reply via email to