> [SNIP - bad drives]

Generally when a disk is displaying bad blocks to the OS, the drive have been remapping blocks for ages in the background. and the disk is really on it's last legs. a bit unlikely that you get so many disks dying at the same time tho. but the problem can have been silently worsening and was not realy noticed until the osd had to restart due to the powerloss.


if this is _very_ important data i would recomend you start by taking the bad drives out of operation, and cloning the bad drive block by block onto a good one. by using dd_rescue. also a good idea to store a image of the disk so you can try the different rescue methods several times. in the very worst case send the disk to a professional data recovery company.

once that is done, you have 2 options:
try to make the osd run again, by. xfs_fsck, + manually finding corrupt objects. (find + md5sum (look for read errors)) and deleting them have helped me in the past. if you manage to get the osd to run, drain it, by setting crush weight to 0. and eventualy remove the disk from the cluster.
alternativly if you can not get the osd running again:
use ceph objectstoretool to extract objects and inject them using a clean node and osd like described in http://ceph.com/geen-categorie/incomplete-pgs-oh-my/ read the man page and help for the tool i think the arguments have changed slightly since that blogpost.

you may also run into read errors on corrupt objects, stopping your export. in that case rm the offending object and rerun the export.
repeat for all bad drives.

when doing the inject it is important that your cluster is operational and able to accept objects from the draining drive, so either set minimal replication type to OSD, or even better. add more osd nodes to make a operational cluster (with missing objects)


also i see in your log you have os-prober testing all partitions. i tend to remove os-prober on machines that does not dualboot with another os.

rules of thumb for future ceph clusters:
min_size =2 for a reason it should never be 1 unless dataloss is wanted.
size=3 f you need the cluster to be operating with a drive or node in a error state. size=2 gives you more space but the cluster will block on errors until the recovery is done. better to be blocking then loosing data. if you have size=3 and 3 nodes and you loose a node, then your cluster can not self heal. you should have more nodes then you have set size to. have free space on drives, this is where data is replicated to in case of a down node. if you have 4 nodes and you want to be able to loose one, and still operate. you need leftover room on your 3 remaining nodes to cover for the lost one. the more nodes you have the less the impact of a node failure is. and the less spare room is needed for a 4 node cluster you should not fill more then 66% if you want to be able to self-heal + operate.



good luck
Ronny Aasen


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to