Hi,
I would be extremely thankful for any assistance in attempting to resolve our
situation and would be happy to pay consultation/support fees:
[admin@kvm5b ~]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 168/4633062 objects misplaced
(0.004%); 1/1398478 objects unfound (0.000%); Reduced data availability: 2 pgs
inactive, 2 pgs down; Degraded data redundancy: 339/4633062 objects degraded
(0.007%), 3 pgs unclean, 1 pg degraded, 1 pg undersized
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
OBJECT_MISPLACED 168/4633062 objects misplaced (0.004%)
OBJECT_UNFOUND 1/1398478 objects unfound (0.000%)
pg 4.43 has 1 unfound objects
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs down
pg 7.4 is down+remapped, acting
[2147483647,2147483647,31,2147483647,32,2147483647]
pg 7.f is down+remapped, acting
[2147483647,2147483647,32,2147483647,30,2147483647]
PG_DEGRADED Degraded data redundancy: 339/4633062 objects degraded (0.007%), 3
pgs unclean, 1 pg degraded, 1 pg undersized
pg 4.43 is stuck undersized for 4933.586411, current state
active+recovery_wait+undersized+degraded+remapped, last acting [27]
pg 7.4 is stuck unclean for 30429.018746, current state down+remapped, last
acting [2147483647,2147483647,31,2147483647,32,2147483647]
pg 7.f is stuck unclean for 30429.010752, current state down+remapped, last
acting [2147483647,2147483647,32,2147483647,30,2147483647]
We've happily been running a 6 node cluster with 4 x FileStore HDDs per node
(journals on SSD partitions) for over a year and recently upgraded all nodes to
Debian 9, Ceph Luminous 12.2.2 and kernel 4.13.8. We ordered 12 x Intel DC
S4600 SSDs which arrived last week so we added two per node on Thursday evening
and brought them up as BlueStore OSDs. We had proactively updated our existing
pools to reference only devices classed as 'hdd', so that we could move select
images over to ssd replicated and erasure coded pools.
We were pretty diligent and downloaded Intel's Firmware Update Tool and
validated that each new drive had the latest available firmware before
installing them in the nodes. We did numerous benchmarks on Friday and
eventually moved some images over to the new storage pools. Everything was
working perfectly and extensive tests on Sunday showed excellent performance.
Sunday night one of the new SSDs died and Ceph replicated and redistributed
data accordingly, then another failed in the early hours of Monday morning and
Ceph did what it needed to.
We had the two failed drives replaced by 11am and Ceph was up to 2/4918587
objects degraded (0.000%) when a third drive failed. At this point we updated
the crush maps for the rbd_ssd and ec_ssd pools and set the device class to
'hdd', to essentially evacuate everything off the SSDs. Other SSDs then failed
at 3:22pm, 4:19pm, 5:49pm and 5:50pm. We've ultimately lost half the Intel
S4600 drives, which are all completely inaccessible. Our status at 11:42pm
Monday night was: 1/1398478 objects unfound (0.000%) and 339/4633062 objects
degraded (0.007%).
We're essentially looking for assistance with:
- Copying images from the damaged pools (attempting to access these currently
results in requests which never timeout)
- Advice on how to later, if Intel can unlock the failed SSDs, import the
missing object shards to gain full access to the images
There are two pools that were configured to use devices with a 'ssd' device
class, namely a replicated pool called 'rbd_ssd' (size 3) and an erasure coded
pool 'ec_ssd' (k=4 and m=2). One 80 GB image was directly in the rbd_ssd pool
and 5 images totalling 1.4 TB were also in the rbd_ssd pool, but with their
data in the ec_ssd pool.
What we've done so far:
- Run Intel diagnostic tools on the failed SSDs, which report 'logically
locked, contact Intel support'. This is giving us some hope that Intel, should
the case finally land with someone that talks English, can somehow unlock the
drives.
- Stopped the rest of the Intel SSD OSDs, peeked at the content of the
BlueStore (ceph-objectstore-tool --op fuse --data-path
/var/lib/ceph/osd/ceph-34 --mountpoint /mnt) unmounted it again and then
exported placement groups:
eg: ceph-objectstore-tool --op export --pgid 7.fs1 --data-path
/var/lib/ceph/osd/ceph-34 --file /exported_data/osd34_7.fs1.export
- Tried 'ceph osd force-create-pg X', 'ceph pg deep-scrub X', 'ceph osd lost
$ID --yes-i-really-mean-it'
We originally deleted the failed OSDs but 'ceph pg 4.43 mark_unfound_lost
delete' yielded 'Error EINVAL: pg has 1 unfound objects but we haven't probed
all sources, not marking lost' so we recreated all the previous OSDs in
partitions of a different SSD and imported the previous exports with the OSD
services stopped.
Status now:
[admin@kvm5b ~]# ceph health detail
HEALTH_WARN noout flag(s) set; Reduced data availability: 3 pgs inactive, 3 pgs
down; Degraded data redundancy: 3 pgs unclean
OSDMAP_FLAGS noout flag(s) set
PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 3 pgs down
pg 4.43 is down, acting [4,15,18]
pg 7.4 is down, acting [8,5,21,18,15,0]
pg 7.f is down, acting [23,0,16,5,11,14]
PG_DEGRADED Degraded data redundancy: 3 pgs unclean
pg 4.43 is stuck unclean since forever, current state down, last acting
[4,15,18]
pg 7.4 is stuck unclean since forever, current state down, last acting
[8,5,21,18,15,0]
pg 7.f is stuck unclean since forever, current state down, last acting
[23,0,16,5,11,14]
Original 'ceph pg X query' status (before we mucked around by exporting and
deleting OSDs): https://pastebin.com/fBQhq6UQ
Current 'ceph pg X query' status (after recreating temporary OSDs with the
original IDs and importing the exports): https://pastebin.com/qcN5uYkN
What we assume needs to be done:
- Tell Ceph that the OSDs are lost (query status in the pastebin above reports
'starting or marking this osd lost may let us proceed'). We have stopped,
marked the temporary OSDs as out and run 'ceph osd lost $ID
--yes-i-really-mean-it' already though.
- Somehow get Ceph to forget about the sharded objects it doesn't have
sufficient pieces of.
- Copy the images to another pool so that we can get pieces of data off these
and rebuild those systems.
- Hopefully get Intel to unlock the drives, export as much of the content as
possible and import the various exports so that we can ultimately copy off
complete images.
Really, really hoping to have a Merry Christmas... ;)
PS: We got the 80 GB image out, it had a single 4MB object hole so we used
ddrescue to read the source image forwards, rebooted the node when it stalled
on the missing data and repeated the copy in reverse direction thereafter...
Regards
David Herselman
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com