Also useful information for autopsy, perhaps not for fixing, is to know whether the SCT ERC value for every drive is less than the kernel's SCSI driver block device command timeout value. It's super important that the drive reports an explicit read failure before the read command is considered failed by the kernel. If the drive is still trying to do a read, and the kernel command timer times out, it'll just do a reset of the whole link and we lose the outcome for the hanging command. Upon explicit read error only, can Btrfs, or md RAID, know what device and physical sector has a problem, and therefore how to reconstruct the block, and fix the bad sector with a write of known good data.
smartctl -l scterc /device/ and cat /sys/block/sda/device/timeout Only if SCT ERC is enabled with a value below 30, or if the kernel command timer is change to be well above 30 (like 180, which is absolutely crazy but a separate conversation) can we be sure that there haven't just been resets going on for a while, preventing bad sectors from being fixed up all along, and can contribute to the problem. This comes up on the linux-raid (mainly md driver) list all the time, and it contributes to lost RAID all the time. And arguably it leads to unnecessary data loss in even the single device desktop/laptop use case as well. Chris Murphy