Re: [ceph-users] CEPH over SW-RAID

Lionel Bouton Mon, 23 Nov 2015 12:41:13 -0800

Le 23/11/2015 21:01, Jose Tavares a écrit :
>
>
>
>     > My new question regarding Ceph is if it isolates this bad sectors where 
> it found bad data when scrubbing? or there will be always a replica of 
> something over a known bad block..?
>     Ceph OSDs don't know about bad sectors, they delegate IO to the
>     filesystems beneath them. Some filesystems can recover from
>     corrupted data from one drive (ZFS or BTRFS when using redundancy
>     at their level) and the same filesystems will refuse to give Ceph
>     OSD data when they detect corruption on non redundant filesystems,
>     Ceph detects this (usually during scrubs) and then manual Ceph
>     repair will rewrite data over the corrupted data (at this time if
>     the underlying drive detected a bad sector it will not reuse it).
>
>
> Just forget about the hardware bad block remapped list. It got filled
> as soon as we start to use the drive .. :)
>
>
>     Then you can move this drive to the trash pile/ask for a
>     replacement. It is basically unusable.
>
>
> Why?
> 1 (or more) out of 8 drives I see have the remap list full ...
> If you isolate the rest using software you can continue to use the
> drive .. There are no performance issues, etc ..
>


Ceph currently uses filesystems to store its data. As there are no
supported filesystems/software layer handling badblocks dynamically, you
*will* have some OSD filesystems being remounted read-only and OSD
failures as soon as you hit one sector misbehaving (if they already
emptied the reserve you are almost guaranteed to get new defective
sectors later, see below). If your bad drives are distributed over your
whole cluster, you will have far more chances of simultaneous failures
and degraded or inactive pgs (which will freeze any IO to them). You
will then have to manually put these OSDs back online to recover
(unfreeze IO). If you don't succeed because the drives failed to the
point that you can't recover the OSD content, you will simply lose data.

>From what I can read here, the main filesystems for Ceph are XFS, Btrfs
and Ext4 with some people using ZFS. Of those 4, only ext4 has support
for manually setting badblocks on an umounted filesystem. If you don't
have the precise offset for each of them, you'll have to scan the whole
device (e2fsck -c) before e2fsck can *try* to put your filesystem in
shape after any bad block is detected. You will have to be very careful
to remove any file using a bad block to avoid corrupting data before
restarting the OSD (hopefully e2fsck should move them for you to
lost+found). You might not be able to restart the OSD depending on the
actual files missing.

Finally at least XFS and Btrfs don't have any support for bad blocks
AFAIK. So you simply can't use your drives with these 2 filesystems
without the filesystems failing and fsck not working. MD raid won't help
you either as it has zero support for badblocks.

The fact that badblocks support is almost non-existent is simple to
understand from past history. Only old filesystems that were used when
drives didn't have internal reserves to handle badblocks transparently
and bad blocks were a normal occurence still have support for keeping
tabs on bad sectors (ext4 got it from ext2, vfat/fat32 has it too, ...).
Today a disk drive which starts to report bad sectors on reads has
emptied its reserve so it has a large history of bad sectors already. It
isn't failing one sector, it's in the process of failing thousands of
them, so there's no reason to expect it to behave correctly anymore :
all the application layers above (md, lvm, filesystems, ...) just don't
try to fight a battle that can't be won and would add complexity and
diminish performance in the case of a normal working drive.

Lionel

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

Reply via email to