On Sun, Apr 15, 2018 at 6:14 AM, Alexander Zapatka
<alexzapa...@gmail.com> wrote:
> i recently set up a drive pool in single mode on my little media
> server.  about a week later SMART started telling me that the drive
> was having issue and there is one bad sector.  since the array is far
> from full i decided to remove the drive from the pool.  but running
> btrfs device remove /dev/sdc /mnt/pool
> resulted in a deadlock.  everything crashed, and i had to pull the
> plug to reboot.  once up i did a btrfs check of the drive and it
> reported no issues with the file system...  but running the remove
> again results in a dead lock.  i have tried running a scrub and it
> eventually results in a dead lock also.

What do you get for:

$ sudo smartctl -l scterc

And can you post a complete dmesg somewhere? Chances are this deadlock
is not really a deadlock, the system is hanging because Btrfs keeps
trying to read a bad block, and it's taking the drive so long to
recover that the kernel does a SATA link reset, and then Btrfs tries
to read again and then you get another hang while the drive decides
what to do - etc and it just doesn't end. But we need the dmesg even
if it takes 30 minutes for the dmesg command to complete - it's
probably easiest to do this with ssh remotely so that the dmesg result
when it finally appears is already on another machine and you don't
have to additionally mess around with outputing it to a file and then
getting the file off the hanging machine.

And don't hard reset it. 'sudo reboot -f' should be sufficient and
safe, even if not immediate, it might take a couple minutes for it it
to actually reboot.

What I'm betting is that you've got a mismatch between the kernel's
scsi command timer (defaults to 30 seconds) and the SCT ERC setting
for the drives. If they're consumer drives they either don't support
SCT ERC or it's disabled by default, in either case the recovery can
be well in excess of 30 seconds. So what you have to do is flip that
around so the drive gives up before the kernel. So either the command
timer has to be increased, or the drive SCT ERC value must be
decreased. And hence we need more info as requested above.

Chris Murphy
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to