On Mon, Nov 23, 2015 at 5:26 PM, Lionel Bouton < [email protected]> wrote:
> Le 23/11/2015 19:58, Jose Tavares a écrit : > > > > On Mon, Nov 23, 2015 at 4:15 PM, Lionel Bouton < > [email protected]> wrote: > >> Hi, >> >> Le 23/11/2015 18:37, Jose Tavares a écrit : >> > Yes, but with SW-RAID, when we have a block that was read and does not >> match its checksum, the device falls out of the array >> >> I don't think so. Under normal circumstances a device only falls out of a >> md array if it doesn't answer IO queries after a timeout (md arrays only >> read from the smallest subset of devices needed to get the data, they don't >> verify redundancy on the fly for performance reasons). This may not be the >> case when you explicitly ask an array to perform a check though (I don't >> have any first-hand check failure coming to mind). >> >> >, and the data is read again from the other devices in the array. The >> problem is that in SW-RAID1 we don't have the badblocks isolated. The disks >> can be sincronized again as the write operation is not tested. The problem >> (device falling out of the array) will happen again if we try to read any >> other data written over the bad block. With consumer-level SATA drives >> badblocks are handled internally nowadays : the drives remap bad sectors to >> a reserve by trying to copy their content there (this might fail and md >> might not have the opportunity to correct the error: it doesn't use >> checksums so it can't tell which drive has unaltered data, only which one >> doesn't answer IO queries). >> > > > hmm, suppose the drive is unable to remap bad blocks internally, when you > write data to the drive, it will also write in hardware the data checksum. > > > One weak data checksum which is not available to the kernel, yes. > Filesystems and applications on top of them may use stronger checksums and > handle read problems that the drives can't detect themselves. > > When you read the data, it will compare to this checksum that was written > previously. If it fails, the drive will reset and the SW-RAID will drop the > drive. This is how sata drives work.. > > > If it fails AFAIK from past experience it doesn't reset by itself, the > kernel driver in charge of the device will receive an IO error and will > retry the IO several times. One of those latter attempts might succeed > (errors aren't always repeatable) and eventually after a timeout it will > try to reset the interface with the drive and the drive itself (the kernel > doesn't know where the problem is only that it didn't get the result it was > expecting). > While this happen I believe the filesystem/md/lvm/... stack can receive an > IO error (the timeout at their level might not be the same as the timeout > at the device level). So some errors can be masked to md and some can > percolate through. In the later case, yes the md array will drop the device. > > >> >> >> >> > My new question regarding Ceph is if it isolates this bad sectors where >> it found bad data when scrubbing? or there will be always a replica of >> something over a known bad block..? >> Ceph OSDs don't know about bad sectors, they delegate IO to the >> filesystems beneath them. Some filesystems can recover from corrupted data >> from one drive (ZFS or BTRFS when using redundancy at their level) and the >> same filesystems will refuse to give Ceph OSD data when they detect >> corruption on non redundant filesystems, Ceph detects this (usually during >> scrubs) and then manual Ceph repair will rewrite data over the corrupted >> data (at this time if the underlying drive detected a bad sector it will >> not reuse it). >> > > Just forget about the hardware bad block remapped list. It got filled as > soon as we start to use the drive .. :) > > > Then you can move this drive to the trash pile/ask for a replacement. It > is basically unusable. > Why? 1 (or more) out of 8 drives I see have the remap list full ... If you isolate the rest using software you can continue to use the drive .. There are no performance issues, etc .. > > > > >> >> > > I also saw that Ceph use same metrics when capturing data from disks. >> When the disk is resetting or have problems, its metrics are going to be >> bad and the cluster will rank bad this osd. But I didn't saw any way of >> sending alerts or anything like that. SW-RAID has its mdadm monitor that >> alerts when things go bad. Should I have to be looking for ceph logs all >> the time to see when things go bad? >> I'm not aware of any osd "ranking". >> >> Lionel >> > > > Does "weight" means the same? > > > There are 2 weights I'm aware of, the crush weight for an OSD and the > temporary OSD weight. The first is the basic weight used by crush to choose > how to split your data (an OSD with a weight of 2 is expected to get > roughly twice the amount of data of an OSD with a weight of 1 on a normal > Ceph cluster), the second is used for temporary adjustments when a OSD gets > temporarily overused (during cluster wide rebalancing typically) and is > reset when the OSD rejoins the cluster (marked in). > > Neither of these weights has anything to do with the OSD underlying device > health ("being bad"). > > Lionel > I don't know where I read about it .. Maybe when I read about scrubbing ..
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
