On Mon, Mar 29, 2021 at 4:22 AM Bas Hulsken <bas.huls...@gmail.com> wrote:
>
> Dear list,
>
> due to a disk intermittently failing in my 4 disk array, I'm getting
> "transid verify failed" errors on my btrfs filesystem (see attached
> dmesg | grep -i btrfs dump in btrfs_dmesg.txt). When I run a scrub, the
> bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try that
> again (happened 3 times now, and was the root cause of the transid
> verify failed errors possibly, at least they did not show up earlier
> than the failed scrub).

Is the dmesg filtered? An unfiltered dmesg might help understand what
might be going on with the drive being unresponsive, if it's spitting
out any kind of errors itself or if there are kernel link reset
messages.

Check if the drive supports SCT ERC.

smartctl -l scterc /dev/sdX

If it does but it isn't enabled, enable it. This is true for all the drives.

smartctl -l scterc,70,70

That will result in the drive giving up on errors much sooner rather
than doing the very slow "deep recovery" on reads. If this goes beyond
30 seconds, the kernel's command timer will think the device is
unresponsive and issue a link reset which is ... bad for this use
case. You really want the drive to error out quickly and allow Btrfs
to do the fixups.

If you can't configure the SCT ERC on the drives, you'll need to
increase the kernel command timeout which is a per device value in
/sys/block/sdX/device/timeout  - default is 30 and chances are 180 is
enough (which sounds terribly high and it is but reportedly some
consumer drives can have such high timeouts).

Basically you want the drive timeout to be shorter than the kernel's.

>A new disk is on it's way to use btrfs replace,
> but I'm not sure whehter that will be a wise choice for a filesystem
> with errors. There was never a crash/power failure, so the filesystem
> was unmounted at every reboot, but as said on 3 occasions (after a
> scrub), that unmount was with on of the four drives unresponsive.

The least amount of risk is to not change anything. When you do the
replace, make sure you use recent btrfs-progs and use 'btrfs replace'
instead of 'btrfs device add/remove'

https://lore.kernel.org/linux-btrfs/20200627032414.gx10...@hungrycats.org/

If metadata is raid5 too, or if it's not already using space_cache v2,
I'd probably leave it alone until after the flakey device is replaced.


> Funnily enough, after a reboot every time the filesystem gets mounted
> without issues (the unresponsive drive is back online), and btrfs check
> --readonly claims the filesystem has no errors (see attached
> btrfs_sdd_check.txt).

I'd take advantage of it's cooperative moment by making sure backups
are fresh in case things get worse.

> Not sure what to do next, so seeking your advice! The important data on
> the drive is backed up, and I'll be running a verify to see if there
> are any corruptions overnight. Would still like to try to save the
> filesystem if possible though.



-- 
Chris Murphy

Reply via email to