Below is the status of my disastrous self-inflicted journey. I will preface
this by admitting this could not have been prevented by software attempting
to keep me from being stupid.

I have a production cluster with over 350 XFS backed osds running Luminous.
We want to transition the cluster to Bluestore for the purpose of enabling
EC for CephFS. We are currently at 75+% utilization and EC coding could
really help us reclaim some much needed capacity. Formatting 1 osd at a
time and waiting on the cluster to backfill for every disk was going to
take a very long time (based on our observations an estimated 240+ days).
Formatting an entire host at once caused a little too much turbulence in
the cluster. Furthermore, we could start the transition to EC if we had
enough hosts with enough disks running Bluestore, before the entire cluster
was migrated. As such, I decided to parallelize. The general idea is that
we could format any osd that didn't have anything other than active+clean
pgs associated. I maintain that this method should work. But, something
went terribly wrong with the script and somehow we formatted disks in a
manner that brought PGs into an incomplete state. It's now pretty obvious
that the affected PGs were backfilling to other osds when the script
clobbered the last remaining good set of objects.

This cluster serves CephFS and a few RBD volumes.

mailing list submissions related to this outage:
cephfs-data-scan pg_files errors
finding and manually recovering objects in bluestore
Determine cephfs paths and rados objects affected by incomplete pg

Our recovery
1) We allowed the cluster to repair itself as much as possible.

2) Following self-healing we were left with 3 PGs incomplete. 2 were in the
cephfs data pool and 1 in an RBD pool.

3) Using ceph pg ${pgid} query, we found all disks known to have recently
contained some of that PG's data

4) For each osd listed in the pg query, we exported the remaining PG data
using ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osdid}/
--pgid ${pgid} --op export --file /media/ceph_recovery/ceph-${
osdid}/recover.${pgid}

5) After having all of the possible exports we compared the recovery files
and chose the largest. I would have appreciated the ability to do a merge
of some sort on these exports, but we'll take what we can get. We're just
going to assume the largest export was the most complete backfill at the
time disaster struck.

6) We removed the nearly empty pg from the acting osds
using ceph-objectstore-tool --op remove --data-path
/var/lib/ceph/osd/ceph-${osdid} --pgid ${pgid}

7) We imported the largest export we had into the acting osds for the pg

8) We marked the pg as complete using the following on the primary
acting ceph-objectstore-tool --op mark-complete --data-path
/var/lib/ceph/osd/ceph-${osdid}/ --pgid ${pgid}

9) We were convinced that it would be possible multiple exports of the same
partially backfilled PG different objects. As such, we started reversing
the format of the export file to extract the objects from the exports and
compared.

10) While our resident reverse engineer was hard at work, focus was shifted
toward tooling for the purpose of identifying corrupt files, rbds, and
appropriate actions for each
10a) A list of all rados objects were dumped for our most valuable data
(CephFS). Our first mechanism of detection is a skip in object sequence
numbers
10b) Because our metadata pool was unaffected by this mess, we are trusting
that ls delivers correct file sizes even for corrupt files. As such, we
should be able to identify how many objects make up the file. If the count
of objects for that file's inode are less than that, there's a problem..
More than the calculated amount??? The world definitely explodes.
10c) Finally, the saddest check is if there are no objects in rados for
that inode.

That's where we are right now. I'll update this thread as we get closer to
recovery from backups and accepting data loss if necessary.

I will note that we wish there were some documentation on using on
ceph-objectstore-tool. We understand that it's for emergencies, but that's
when concise documentation is most important. From what we've found, the
only documentation seems to be --help and the source code.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to