hello

I have a few osd's in my cluster that are regularly crashing.

in the log of them i can see

osd.7
-1> 2016-10-06 08:09:18.869687 7ffaa037f700 -1 osd.7 pg_epoch: 128840 pg[5.3as0( v 84797'30080 (67219'27080,84797'30080] local-les=128834 n=13146 ec=61149 les/c 128834/127358 128829/128829/128829) [7,109,4,0,62,32]/[7,109,32,0,62,39] r=0 lpr=128829 pi=127357-128828/12 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: inconsistent shard sizes 5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head the offending shard must be manually removed after verifying there are enough shards to recover (0, 8388608, [32(2),0, 39(5),0])


osd.32
-411> 2016-10-06 13:21:15.166968 7fe45b6cb700 -1 osd.32 pg_epoch: 129181 pg[5.3as2( v 84797'30080 (67219'27080,84797'30080] local-les=129171 n=13146 ec=61149 les/c 129171/127358 129170/129170/129170) [2147483647,2147483647,4,0,62,32]/[2147483647,2147483647,32,0,62,39] r=2 lpr=129170 pi=121260-129169/43 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] handle_recovery_read_complete: inconsistent shard sizes 5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head the offending shard must be manually removed after verifying there are enough shards to recover (0, 8388608, [32(2),0, 39(5),0])



osd.109
-1> 2016-10-06 13:17:36.748340 7fa53d36c700 -1 osd.109 pg_epoch: 129167 pg[5.3as1( v 84797'30080 (66310'24592,84797'30080] local-les=129163 n=13146 ec=61149 les/c 129163/127358 129162/129162/129162) [2147483647,109,4,0,62,32]/[2147483647,109,32,0,62,39] r=1 lpr=129162 pi=112552-129161/59 rops=5 bft=4(2),32(5) crt=84797'30076 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] handle_recovery_read_complete: inconsistent shard sizes 5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head the offending shard must be manually removed after verifying there are enough shards to recover (0, 8388608, [32(2),0, 39(5),0])


ofcourse having 3 osd's dying regularly is not good for my health. so i have set noout, to avoid heavy recoveries.

googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946

where it saies:  "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the broken shard. (or perhaps all 3 of them?)

a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be nice.

hope someone have an idea how to progress.

kind regards
Ronny Aasen

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to