Re: uncorrectable errors after btrfs replace

Chris Murphy Sun, 18 Aug 2013 14:44:11 -0700

On Aug 18, 2013, at 1:12 PM, Stuart Pook <[email protected]> wrote:

>    6  btrfs filesystem resize 580g .


You first shrank a 2TB btrfs file system on dmcrypt device to 590GB. But then 
you didn't resize the dm device or the partition?

> 9  time btrfs  balance start -musage=1 -dusage=1 . && time  btrfs filesystem 
> resize 580g .
>  10  time  btrfs filesystem resize 590g .


You followed the resize of the fs, but not the underlying devices, with a 
balance, then resized it two more times? This is weird, but also makes the 
sequence difficult to follow.

>   13  time btrfs replace start  /dev/dm-11 /dev/dm-12 -B /disks/backups
>   14  time btrfs replace start  /dev/dm-11 /dev/dm-12 -B /disks/backups

Why is this command repeated? What's with the numbering system that skips 
numbers?

> 
> 
> [...]
> Aug 18 12:28:03 kooka kernel: [54125.020262] ata10: hard resetting link
> Aug 18 12:28:03 kooka kernel: [54125.512032] ata10: SATA link up 3.0 Gbps 
> (SStatus 123 SControl 300)
> Aug 18 12:28:03 kooka kernel: [54125.523759] ata10.00: configured for UDMA/133
> Aug 18 12:28:03 kooka kernel: [54125.536380] ata10: EH complete
> Aug 18 12:28:04 kooka kernel: [54125.770176] ata10.00: exception Emask 0x10 
> SAct 0x7fffffff SErr 0x780100 action 0x6
> Aug 18 12:28:04 kooka kernel: [54125.770181] ata10.00: irq_stat 0x08000000
> Aug 18 12:28:04 kooka kernel: [54125.770184] ata10: SError: { UnrecovData 
> 10B8B Dispar BadCRC Handshk }
> [...]
> Aug 18 12:28:17 kooka kernel: [54138.957095] ata10.00: status: { DRDY }
> Aug 18 12:28:17 kooka kernel: [54138.957100] ata10: hard resetting link
> Aug 18 12:28:17 kooka kernel: [54139.448029] ata10: SATA link up 1.5 Gbps 
> (SStatus 113 SControl 310)
> Aug 18 12:28:17 kooka kernel: [54139.449972] ata10.00: configured for UDMA/133
> Aug 18 12:28:17 kooka kernel: [54139.464065] ata10: EH complete

Bad connection so libata is dropping the link from 3 Gbps to 1.5Gbps.
> 

> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always      
>  -       12080

This confirms that both ends of the cable are sensing communication problems 
between drive and controller. The cable needs to be replaced, likely it's the 
connector not the cable itself.


> I guess that /disks/backup is mostly dead and that I should just reformat it. 
>  What do you think?

Well I think I'd try to simplify this drastically and see if you've got a 
reproducing bug. The steps you've got I find mostly incoherent, so I can't try 
to do what you did to see if it's reproducible.

> Next time I'll watch /var/log/syslog but I would have preferred that "btrfs 
> replace" stop when getting errors.

The errors should be self correcting, but the mere fact they're happening means 
that some errors could be occurring but aren't detected. If the data is 
corrupting in-transit, but the drive or controller didn't report a problem, 
then btrfs has no way of knowing it was written incorrectly. There's only so 
much software can do to overcome blatant hardware problems.

But, it seems unlikely such a high percent of errors would go undetected to 
result in so many uncorrectable errors, so there may be user error here along 
with a bug.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: uncorrectable errors after btrfs replace

Reply via email to