Well I did bad I just don't know how bad yet.  Before we get into it my
critical data is backed up to CrashPlan.  I'd rather not lose all my
archive data.  Losing some of the data is ok.

I added a bunch of disks to my ceph cluster so I turned off the cluster and
dd'd the raw disks around so that the disks and osd's were ordered by id's
on the HBA.  I fat fingered one disk and overwrote it.  Another disk didn't
dd correctly... it seems to have not unmounted correctly plus it has some
failures according to smartctl.  A repair_xfs command put a whole bunch of
data into lost+found.

I brought the cluster up and let it settle down.  The result is 49 stuck
pg's and CephFS is halted.

ceph -s is here <https://pastebin.com/RXSinLjZ>
ceph osd tree is here <https://pastebin.com/qmE0dhyH>
ceph pg dump minus the active pg's is here <https://pastebin.com/36kpmA8s>

OSD-2 is gone with no chance to restore it.

OSD-3 had the xfs corruption.  I have a bunch of
/var/lib/ceph/osd/ceph-3/lost+found/blah/DIR_[0-9]+/blah.blah__head_blah.blah
files after xfs_repair.  I for looped these files through ceph osd map
<pool> $file and it seems they have all been replicated to other OSD's.
It seems to be safe to delete this data.

There are files named [0-9]+ in the top level of
/var/lib/ceph/osd/ceph-3/lost+found.  I don't know what to do with these
files.


I have a couple questions:
1) can the top level lost+found files be used to recreate the stuck pg's?

2a) can the pg's be dropped and recreated to bring the cluster to a healthy
state?
2b) if i do this can CephFS be restored with just partial data loss?  The
cephfs documentation isn't quite clear on how to do this.

Thanks for your time and help!
/Chris
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to