Hi all,
A couple of weeks ago i've upgraded from emperor to firefly.
I'm using Cloudstack /w CEPH as the storage backend for VMs and templates.
Since the upgrade, ceph is in a HEALTH_ERR with 500+ pgs inconsistent and
2000+ scrub errors. Not sure if it has the do with firefly though, but
the upgrade was the only major change I had.
After the upgrade i've noticed that some of my OSDs were near-full. My
current ceph setup has two racks defined, each with a couple of hosts.
One rack was purely for archiving/backup purposes and wasn't that active
at all, so I've changed the crushmap and moved some hosts from one rack
to another. I've noticed no problems during this move at all and the
cluster was rebalancing itself after this change. The current problems I
have began after the upgrade and the hosts move.
The logs shows messages like:
2014-06-05 12:09:54.233404 osd.0 [ERR] 9.ac shard 0: soid
1e3d14ac/rbd_data.867c0514e5cb0.00000000000000e3/head//9 digest 693024524
!= known digest 2075700712
Manual repair with for example "ceph osd repair" doesn't fix the
inconsistency. I've investigated the rbd image(s) and can pinpoint it to a
specific VM. When I delete this VM (with the inconsistency pgs in it) from
ceph and run a deep-scrub again, the inconsistency is gone (makes sense,
because the rbd image is removed). But when I re-create the VM, I get the
same inconsistency errors again. The errors are showing in the same ceph
pool, but different pg. First I thought the base template was the faulty
image, but even after removing the base VM template and re-creating a new
template the inconsistencies still occur.
In total I have 8 pools, and the problem exists in at least half of them.
It doesn't look like the osd itself has any problems or has HDD bad
sectors. The inconsistency is spread over a bunch of different (almost
all actually) OSDs.
It seems the VMs are running fine though, even with all these
inconsistency errors, but I'm still worried because I doubt this is a
false-positive..
I'm at a loss at the moment and not sure what my next step would be.
Is there anyone who can shed some light over this issue?
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com