On Jan 9, 2014, at 12:13 PM, Kyle Gates <kylega...@hotmail.com> wrote:
> On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote: >> >> On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote: >> >>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: >>>> Hi, >>>> >>>> I am running write-intensive (well sort of, one write every 10s) >>>> workloads on cheap flash media which proved to be horribly unreliable. >>>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb >>>> pen drive returns bogus data without any warning at all. >>>> >>>> So I wonder, how would btrfs behave in raid1 on two such devices? >>>> Would it simply mark bad blocks as "bad" and continue to be >>>> operational, or will it bail out when some block can not be >>>> read/written anymore on one of the two devices? >>> >>> If a block is read and fails its checksum, then the other copy (in >>> RAID-1) is checked and used if it's good. The bad copy is rewritten to >>> use the good data. >>> >>> If the block is bad such that writing to it won't fix it, then >>> there's probably two cases: the device returns an IO error, in which >>> case I suspect (but can't be sure) that the FS will go read-only. Or >>> the device silently fails the write and claims success, in which case >>> you're back to the situation above of the block failing its checksum. >> >> In a normally operating drive, when the drive firmware locates a physical >> sector with persistent write failures, it's dereferenced. So the LBA points >> to a reserve physical sector, the originally can't be accessed by LBA. If >> all of the reserve sectors get used up, the next persistent write failure >> will result in a write error reported to libata and this will appear in >> dmesg, and should be treated as the drive being no longer in normal >> operation. It's a drive useful for storage developers, but not for >> production usage. >> >>> There's no marking of bad blocks right now, and I don't know of >>> anyone working on the feature, so the FS will probably keep going back >>> to the bad blocks as it makes CoW copies for modification. >> >> This is maybe relevant: >> https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html >> >> "READ and WRITE commands report CHS or LBA of the first failed sector but >> ATA/ATAPI standard specifies that the amount of transferred data on error >> completion is indeterminate, so we cannot assume that sectors preceding the >> failed sector have been transferred and thus cannot complete those sectors >> successfully as SCSI does." >> >> If I understand that correctly, Btrfs really ought to either punt the >> device, or make the whole volume read-only. For production use, going >> read-only very well could mean data loss, even while preserving the state of >> the file system. Eventually I'd rather see the offending device ejected from >> the volume, and for the volume to remain rw,degraded. > > I would like to see btrfs hold onto the device in a read-only state like is > done during a device replace operation. New writes would maintain the raid > level but go out to the remaining devices and only go full filesystem > read-only if the minimum number of writable devices is not met. Once a new > device is added in, the replace operation could commence and drop the bad > device when complete. Sure that's a fine optimization for a bad device to be read-only while the volume is still rw, if that's possible. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html