Re: [ceph-users] CEPH over SW-RAID

Jose Tavares Mon, 23 Nov 2015 12:01:33 -0800

On Mon, Nov 23, 2015 at 5:26 PM, Lionel Bouton <
[email protected]> wrote:


> Le 23/11/2015 19:58, Jose Tavares a écrit :
>
>
>
> On Mon, Nov 23, 2015 at 4:15 PM, Lionel Bouton <
> [email protected]> wrote:
>
>> Hi,
>>
>> Le 23/11/2015 18:37, Jose Tavares a écrit :
>> > Yes, but with SW-RAID, when we have a block that was read and does not
>> match its checksum, the device falls out of the array
>>
>> I don't think so. Under normal circumstances a device only falls out of a
>> md array if it doesn't answer IO queries after a timeout (md arrays only
>> read from the smallest subset of devices needed to get the data, they don't
>> verify redundancy on the fly for performance reasons). This may not be the
>> case when you explicitly ask an array to perform a check though (I don't
>> have any first-hand check failure coming to mind).
>>
>> >, and the data is read again from the other devices in the array. The
>> problem is that in SW-RAID1 we don't have the badblocks isolated. The disks
>> can be sincronized again as the write operation is not tested. The problem
>> (device falling out of the array) will happen again if we try to read any
>> other data written over the bad block. With consumer-level SATA drives
>> badblocks are handled internally nowadays : the drives remap bad sectors to
>> a reserve by trying to copy their content there (this might fail and md
>> might not have the opportunity to correct the error: it doesn't use
>> checksums so it can't tell which drive has unaltered data, only which one
>> doesn't answer IO queries).
>>
>
>
> hmm, suppose the drive is unable to remap bad blocks internally, when you
> write data to the drive, it will also write in hardware the data checksum.
>
>
> One weak data checksum which is not available to the kernel, yes.
> Filesystems and applications on top of them may use stronger checksums and
> handle read problems that the drives can't detect themselves.
>
> When you read the data, it will compare to this checksum that was written
> previously. If it fails, the drive will reset and the SW-RAID will drop the
> drive. This is how sata drives work..
>
>
> If it fails AFAIK from past experience it doesn't reset by itself, the
> kernel driver in charge of the device will receive an IO error and will
> retry the IO several times. One of those latter attempts might succeed
> (errors aren't always repeatable) and eventually after a timeout it will
> try to reset the interface with the drive and the drive itself (the kernel
> doesn't know where the problem is only that it didn't get the result it was
> expecting).
> While this happen I believe the filesystem/md/lvm/... stack can receive an
> IO error (the timeout at their level might not be the same as the timeout
> at the device level). So some errors can be masked to md and some can
> percolate through. In the later case, yes the md array will drop the device.
>
>
>>
>>
>>
>> > My new question regarding Ceph is if it isolates this bad sectors where
>> it found bad data when scrubbing? or there will be always a replica of
>> something over a known bad block..?
>> Ceph OSDs don't know about bad sectors, they delegate IO to the
>> filesystems beneath them. Some filesystems can recover from corrupted data
>> from one drive (ZFS or BTRFS when using redundancy at their level) and the
>> same filesystems will refuse to give Ceph OSD data when they detect
>> corruption on non redundant filesystems, Ceph detects this (usually during
>> scrubs) and then manual Ceph repair will rewrite data over the corrupted
>> data (at this time if the underlying drive detected a bad sector it will
>> not reuse it).
>>
>
> Just forget about the hardware bad block remapped list. It got filled as
> soon as we start to use the drive .. :)
>
>
> Then you can move this drive to the trash pile/ask for a replacement. It
> is basically unusable.
>

Why?
1 (or more) out of 8 drives I see have the remap list full ...
If you isolate the rest using software you can continue to use the drive ..
There are no performance issues, etc ..




>
>
>
>
>>
>> > > I also saw that Ceph use same metrics when capturing data from disks.
>> When the disk is resetting or have problems, its metrics are going to be
>> bad and the cluster will rank bad this osd. But I didn't saw any way of
>> sending alerts or anything like that. SW-RAID has its mdadm monitor that
>> alerts when things go bad. Should I have to be looking for ceph logs all
>> the time to see when things go bad?
>> I'm not aware of any osd "ranking".
>>
>> Lionel
>>
>
>
> Does "weight" means the same?
>
>
> There are 2 weights I'm aware of, the crush weight for an OSD and the
> temporary OSD weight. The first is the basic weight used by crush to choose
> how to split your data (an OSD with a weight of 2 is expected to get
> roughly twice the amount of data of an OSD with a weight of 1 on a normal
> Ceph cluster), the second is used for temporary adjustments when a OSD gets
> temporarily overused (during cluster wide rebalancing typically) and is
> reset when the OSD rejoins the cluster (marked in).
>
> Neither of these weights has anything to do with the OSD underlying device
> health ("being bad").
>
> Lionel
>

I don't know where I read about it .. Maybe when I read about scrubbing ..

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

Reply via email to