On Jan 9, 2014, at 12:13 PM, Kyle Gates <kylega...@hotmail.com> wrote:

> On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote:
>> 
>> On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote:
>> 
>>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>>>> Hi,
>>>> 
>>>> I am running write-intensive (well sort of, one write every 10s)
>>>> workloads on cheap flash media which proved to be horribly unreliable.
>>>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
>>>> pen drive returns bogus data without any warning at all.
>>>> 
>>>> So I wonder, how would btrfs behave in raid1 on two such devices?
>>>> Would it simply mark bad blocks as "bad" and continue to be
>>>> operational, or will it bail out when some block can not be
>>>> read/written anymore on one of the two devices?
>>> 
>>> If a block is read and fails its checksum, then the other copy (in
>>> RAID-1) is checked and used if it's good. The bad copy is rewritten to
>>> use the good data.
>>> 
>>> If the block is bad such that writing to it won't fix it, then
>>> there's probably two cases: the device returns an IO error, in which
>>> case I suspect (but can't be sure) that the FS will go read-only. Or
>>> the device silently fails the write and claims success, in which case
>>> you're back to the situation above of the block failing its checksum.
>> 
>> In a normally operating drive, when the drive firmware locates a physical 
>> sector with persistent write failures, it's dereferenced. So the LBA points 
>> to a reserve physical sector, the originally can't be accessed by LBA. If 
>> all of the reserve sectors get used up, the next persistent write failure 
>> will result in a write error reported to libata and this will appear in 
>> dmesg, and should be treated as the drive being no longer in normal 
>> operation. It's a drive useful for storage developers, but not for 
>> production usage.
>> 
>>> There's no marking of bad blocks right now, and I don't know of
>>> anyone working on the feature, so the FS will probably keep going back
>>> to the bad blocks as it makes CoW copies for modification.
>> 
>> This is maybe relevant:
>> https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html
>> 
>> "READ and WRITE commands report CHS or LBA of the first failed sector but 
>> ATA/ATAPI standard specifies that the amount of transferred data on error 
>> completion is indeterminate, so we cannot assume that sectors preceding the 
>> failed sector have been transferred and thus cannot complete those sectors 
>> successfully as SCSI does."
>> 
>> If I understand that correctly, Btrfs really ought to either punt the 
>> device, or make the whole volume read-only. For production use, going 
>> read-only very well could mean data loss, even while preserving the state of 
>> the file system. Eventually I'd rather see the offending device ejected from 
>> the volume, and for the volume to remain rw,degraded.
> 
> I would like to see btrfs hold onto the device in a read-only state like is 
> done during a device replace operation. New writes would maintain the raid 
> level but go out to the remaining devices and only go full filesystem 
> read-only if the minimum number of writable devices is not met. Once a new 
> device is added in, the replace operation could commence and drop the bad 
> device when complete.       

Sure that's a fine optimization for a bad device to be read-only while the 
volume is still rw, if that's possible.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to