Hi all,

since some weeks we have a small problem with one of the PG's on our ceph 
cluster.
Every time the pg 2.10d is deep scrubbing it fails because of this:
2018-08-06 19:36:28.080707 osd.14 osd.14 *.*.*.110:6809/3935 133 : cluster 
[ERR] 2.10d scrub stat mismatch, got 397/398 objects, 0/0 clones, 397/398 
dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
2609281919/2609293215 bytes, 0/0 hit_set_archive bytes.
2018-08-06 19:36:28.080905 osd.14 osd.14 *.*.*.110:6809/3935 134 : cluster 
[ERR] 2.10d scrub 1 errors
As far as I understand ceph is missing an object on that osd.14 which should be 
stored on this osd. A small ceph pg repair 2.10d fixes the problem but as soon 
as a deep scrubbing job for that pg is running again(manual or automatically) 
the problem is back again.
I tried to find out which object is missing, but a small search leads me to the 
result that there is no real way to find out which objects are stored in this 
PG or which object exactly is missing.
That's why I've gone for some "unconventional" methods.
I completely removed OSD.14 from the cluster. I waited until everything was 
balanced and then added the OSD again.
Unfortunately the problem is still there.

Some weeks later we've added a huge amount of OSD's to our cluster which had a 
big impact on the crush map.
Since then the PG 2.10d was running on two other OSD's -> [119,93] (We have a 
replica of 2)
Still the same error message, but another OSD:
2018-10-03 03:39:22.776521 7f12d9979700 -1 log_channel(cluster) log [ERR] : 
2.10d scrub stat mismatch, got 728/729 objects, 0/0 clones, 728/729 dirty, 0/0 
omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7281369687/7281381269 
bytes, 0/0 hit_set_archive bytes.

As a first step it would be enough for me to find out which the problematic 
object is. Then I am able to check if the object is critical, if any recovery 
is required or if I am able to just drop that object(That would be 90% of the 
case)
I hope anyone is able to help me to get rid of this.
It's not really a problem for us. Ceph runs despite this message without 
further problems.
It's just a bit annoying that every time the error occurs our monitoring 
triggers a big alarm because Ceph is in ERROR status. :)

Thanks in advance,
Roman

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to