On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote: > > On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote: > >> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: >>> Hi, >>> >>> I am running write-intensive (well sort of, one write every 10s) >>> workloads on cheap flash media which proved to be horribly unreliable. >>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb >>> pen drive returns bogus data without any warning at all. >>> >>> So I wonder, how would btrfs behave in raid1 on two such devices? >>> Would it simply mark bad blocks as "bad" and continue to be >>> operational, or will it bail out when some block can not be >>> read/written anymore on one of the two devices? >> >> If a block is read and fails its checksum, then the other copy (in >> RAID-1) is checked and used if it's good. The bad copy is rewritten to >> use the good data. >> >> If the block is bad such that writing to it won't fix it, then >> there's probably two cases: the device returns an IO error, in which >> case I suspect (but can't be sure) that the FS will go read-only. Or >> the device silently fails the write and claims success, in which case >> you're back to the situation above of the block failing its checksum. > > In a normally operating drive, when the drive firmware locates a physical > sector with persistent write failures, it's dereferenced. So the LBA points > to a reserve physical sector, the originally can't be accessed by LBA. If all > of the reserve sectors get used up, the next persistent write failure will > result in a write error reported to libata and this will appear in dmesg, and > should be treated as the drive being no longer in normal operation. It's a > drive useful for storage developers, but not for production usage. > >> There's no marking of bad blocks right now, and I don't know of >> anyone working on the feature, so the FS will probably keep going back >> to the bad blocks as it makes CoW copies for modification. > > This is maybe relevant: > https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html > > "READ and WRITE commands report CHS or LBA of the first failed sector but > ATA/ATAPI standard specifies that the amount of transferred data on error > completion is indeterminate, so we cannot assume that sectors preceding the > failed sector have been transferred and thus cannot complete those sectors > successfully as SCSI does." > > If I understand that correctly, Btrfs really ought to either punt the device, > or make the whole volume read-only. For production use, going read-only very > well could mean data loss, even while preserving the state of the file > system. Eventually I'd rather see the offending device ejected from the > volume, and for the volume to remain rw,degraded.
I would like to see btrfs hold onto the device in a read-only state like is done during a device replace operation. New writes would maintain the raid level but go out to the remaining devices and only go full filesystem read-only if the minimum number of writable devices is not met. Once a new device is added in, the replace operation could commence and drop the bad device when complete. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html