Success! Hopefully my notes from the process will help:
In the event of multiple disk failures the cluster could lose PGs. Should this
occur it is best to attempt to restart the OSD process and have the drive
marked as up+out. Marking the drive as out will cause data to flow off the
drive to elsewhere in the cluster. In the event that the ceph-osd process is
unable to keep running you could try using the ceph_objectstore_tool program to
extract just the damaged PGs and import them into working PGs.
Fixing Journals
In this particular scenario things were complicated by the fact that
ceph_objectstore_tool came out in Giant but we were running Firefly. Not
wanting to upgrade the cluster in a degraded state this required that the OSD
drives be moved to a different physical machine for repair. This added a lot of
steps related to the journals but it wasn't a big deal. That process looks like:
On Storage1:
stop ceph-osd id=15
ceph-osd -i 15 --flush-journal
ls -l /var/lib/ceph/osd/ceph-15/journal
Note the journal device UUID then pull the disk and move it to Ithome:
rm /var/lib/ceph/osd/ceph-15/journal
ceph-osd -i 15 --mkjournal
That creates a colocated journal for which to use during the
ceph_objectstore_tool commands. Once done then:
ceph-osd -i 15 --flush-journal
rm /var/lib/ceph/osd/ceph-15/journal
Pull the disk and bring it back to Storage1. Then:
ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f
/var/lib/ceph/osd/ceph-15/journal
ceph-osd -i 15 --mkjournal
start ceph-osd id=15
This all won't be needed once the cluster is running Hammer because then there
will be an available version of ceph_objectstore_tool on the local machine and
you can keep the journals throughout the process.
Recovery Process
We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and
OSD.15 which were the two disks which failed out of Storage1. The disk for
OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more
cooperative but not in a place to be up and running in the cluster. I took the
dying OSD.15 drive and placed it into a new physical machine with a fresh
install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to
extract the PGs with a command like:
for i in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-15
--journal /var/lib/ceph/osd/ceph-15/journal --op export --pgid $i --file
~/${i}.export
Once both PGs were successfully exported I attempted to import them into a new
temporary OSD following instructions from here. For some reason that didn't
work. The OSD was up+in but wasn't backfilling the PGs into the cluster. If you
find yourself in this process I would try that first just in case it provides a
cleaner process.
Considering the above didn't work and we were looking at the possibility of
losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking
35TB) I took what I might describe as heroic measures:
Running
ceph pg dump | grep incomplete
3.c7 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.968841 0'0
15730:17 [15,0] 15 [15,0] 15 13985'54076 2015-03-31 19:14:22.721695
13985'54076 2015-03-31 19:14:22.721695
3.102 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.529594 0'0
15730:21 [0,15] 0 [0,15] 0 13985'53107 2015-03-29 21:17:15.568125
13985'49195 2015-03-24 18:38:08.244769
Then I stopped all OSDs, which blocked all I/O to the cluster, with:
stop ceph-osd-all
Then I looked for all copies of the PG on all OSDs with:
for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name "$i"
; done | sort -V
/var/lib/ceph/osd/ceph-0/current/3.c7_head
/var/lib/ceph/osd/ceph-0/current/3.102_head
/var/lib/ceph/osd/ceph-3/current/3.c7_head
/var/lib/ceph/osd/ceph-13/current/3.102_head
/var/lib/ceph/osd/ceph-15/current/3.c7_head
/var/lib/ceph/osd/ceph-15/current/3.102_head
Then I flushed the journals for all of those OSDs with:
for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done
Then I removed all of those drives and moved them (using Journal Fixing above)
to Ithome where I used ceph_objectstore_tool to remove all traces of 3.102 and
3.c7:
for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data
/var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op
remove --pgid $j ; done ; done
Then I imported the PGs onto OSD.0 and OSD.15 with:
for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data
/var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op
import --file ~/${j}.export ; done ; done
for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm
/var/log/ceph/osd/ceph-$i/journal ; done
Then I moved the disks back to Storage1 and started them all back up again. I
think that this should have worked but what happened in this case was that
OSD.0 didn't start up for some reason. I initially thought that that wouldn't
matter because OSD.15 did start and so we should have had everything but a ceph
pg query of the PGs showed something like:
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [0],
"peering_blocked_by": [{
"osd": 0,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}]
So I then removed OSD.0 from the cluster and everything came back to life.
Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans!
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com