[ceph-users] PGs inconsistency, deep-scrub / repair won't fix (v0.80.1)

Dennis Kramer Thu, 05 Jun 2014 04:39:13 -0700

Hi all,

A couple of weeks ago i've upgraded from emperor to firefly.
I'm using Cloudstack /w CEPH as the storage backend for VMs and templates.

Since the upgrade, ceph is in a HEALTH_ERR with 500+ pgs inconsistent and2000+ scrub errors. Not sure if it has the do with firefly though, butthe upgrade was the only major change I had.

After the upgrade i've noticed that some of my OSDs were near-full. Mycurrent ceph setup has two racks defined, each with a couple of hosts.One rack was purely for archiving/backup purposes and wasn't that activeat all, so I've changed the crushmap and moved some hosts from one rackto another. I've noticed no problems during this move at all and thecluster was rebalancing itself after this change. The current problems Ihave began after the upgrade and the hosts move.


The logs shows messages like:

2014-06-05 12:09:54.233404 osd.0 [ERR] 9.ac shard 0: soid1e3d14ac/rbd_data.867c0514e5cb0.00000000000000e3/head//9 digest 693024524!= known digest 2075700712

Manual repair with for example "ceph osd repair" doesn't fix theinconsistency. I've investigated the rbd image(s) and can pinpoint it to aspecific VM. When I delete this VM (with the inconsistency pgs in it) fromceph and run a deep-scrub again, the inconsistency is gone (makes sense,because the rbd image is removed). But when I re-create the VM, I get thesame inconsistency errors again. The errors are showing in the same cephpool, but different pg. First I thought the base template was the faultyimage, but even after removing the base VM template and re-creating a newtemplate the inconsistencies still occur.


In total I have 8 pools, and the problem exists in at least half of them.

It doesn't look like the osd itself has any problems or has HDD badsectors. The inconsistency is spread over a bunch of different (almostall actually) OSDs.

It seems the VMs are running fine though, even with all theseinconsistency errors, but I'm still worried because I doubt this is afalse-positive..


I'm at a loss at the moment and not sure what my next step would be.
Is there anyone who can shed some light over this issue?


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PGs inconsistency, deep-scrub / repair won't fix (v0.80.1)

Reply via email to