Re: [ceph-users] CEPH over SW-RAID

Lionel Bouton Mon, 23 Nov 2015 10:17:06 -0800

Hi,

Le 23/11/2015 18:37, Jose Tavares a écrit :
> Yes, but with SW-RAID, when we have a block that was read and does not match 
> its checksum, the device falls out of the array


I don't think so. Under normal circumstances a device only falls out of
a md array if it doesn't answer IO queries after a timeout (md arrays
only read from the smallest subset of devices needed to get the data,
they don't verify redundancy on the fly for performance reasons). This
may not be the case when you explicitly ask an array to perform a check
though (I don't have any first-hand check failure coming to mind).

>, and the data is read again from the other devices in the array. The problem 
>is that in SW-RAID1 we don't have the badblocks isolated. The disks can be 
>sincronized again as the write operation is not tested. The problem (device 
>falling out of the array) will happen again if we try to read any other data 
>written over the bad block. With consumer-level SATA drives badblocks are 
>handled internally
nowadays : the drives remap bad sectors to a reserve by trying to copy
their content there (this might fail and md might not have the
opportunity to correct the error: it doesn't use checksums so it can't
tell which drive has unaltered data, only which one doesn't answer IO
queries).

> My new question regarding Ceph is if it isolates this bad sectors where it 
> found bad data when scrubbing? or there will be always a replica of something 
> over a known bad block..?
Ceph OSDs don't know about bad sectors, they delegate IO to the
filesystems beneath them. Some filesystems can recover from corrupted
data from one drive (ZFS or BTRFS when using redundancy at their level)
and the same filesystems will refuse to give Ceph OSD data when they
detect corruption on non redundant filesystems, Ceph detects this
(usually during scrubs) and then manual Ceph repair will rewrite data
over the corrupted data (at this time if the underlying drive detected a
bad sector it will not reuse it).

> > I also saw that Ceph use same metrics when capturing data from disks.
When the disk is resetting or have problems, its metrics are going to be
bad and the cluster will rank bad this osd. But I didn't saw any way of
sending alerts or anything like that. SW-RAID has its mdadm monitor that
alerts when things go bad. Should I have to be looking for ceph logs all
the time to see when things go bad?
I'm not aware of any osd "ranking".

Lionel

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

Reply via email to