Hi all,

A couple of weeks ago i've upgraded from emperor to firefly.
I'm using Cloudstack /w CEPH as the storage backend for VMs and templates.

Since the upgrade, ceph is in a HEALTH_ERR with 500+ pgs inconsistent and 2000+ scrub errors. Not sure if it has the do with firefly though, but the upgrade was the only major change I had.

After the upgrade i've noticed that some of my OSDs were near-full. My current ceph setup has two racks defined, each with a couple of hosts. One rack was purely for archiving/backup purposes and wasn't that active at all, so I've changed the crushmap and moved some hosts from one rack to another. I've noticed no problems during this move at all and the cluster was rebalancing itself after this change. The current problems I have began after the upgrade and the hosts move.

The logs shows messages like:

2014-06-05 12:09:54.233404 osd.0 [ERR] 9.ac shard 0: soid 1e3d14ac/rbd_data.867c0514e5cb0.00000000000000e3/head//9 digest 693024524 != known digest 2075700712

Manual repair with for example "ceph osd repair" doesn't fix the inconsistency. I've investigated the rbd image(s) and can pinpoint it to a specific VM. When I delete this VM (with the inconsistency pgs in it) from ceph and run a deep-scrub again, the inconsistency is gone (makes sense, because the rbd image is removed). But when I re-create the VM, I get the same inconsistency errors again. The errors are showing in the same ceph pool, but different pg. First I thought the base template was the faulty image, but even after removing the base VM template and re-creating a new template the inconsistencies still occur.

In total I have 8 pools, and the problem exists in at least half of them.

It doesn't look like the osd itself has any problems or has HDD bad sectors. The inconsistency is spread over a bunch of different (almost all actually) OSDs.

It seems the VMs are running fine though, even with all these inconsistency errors, but I'm still worried because I doubt this is a false-positive..

I'm at a loss at the moment and not sure what my next step would be.
Is there anyone who can shed some light over this issue?


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to