[ceph-users] Many concurrent drive failures - How do I activate pgs?

David Herselman Wed, 20 Dec 2017 14:26:32 -0800

Hi,

I would be extremely thankful for any assistance in attempting to resolve our 
situation and would be happy to pay consultation/support fees:
[admin@kvm5b ~]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 168/4633062 objects misplaced 
(0.004%); 1/1398478 objects unfound (0.000%); Reduced data availability: 2 pgs 
inactive, 2 pgs down; Degraded data redundancy: 339/4633062 objects degraded 
(0.007%), 3 pgs unclean, 1 pg degraded, 1 pg undersized
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
OBJECT_MISPLACED 168/4633062 objects misplaced (0.004%)
OBJECT_UNFOUND 1/1398478 objects unfound (0.000%)
    pg 4.43 has 1 unfound objects
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs down
    pg 7.4 is down+remapped, acting 
[2147483647,2147483647,31,2147483647,32,2147483647]
    pg 7.f is down+remapped, acting 
[2147483647,2147483647,32,2147483647,30,2147483647]
PG_DEGRADED Degraded data redundancy: 339/4633062 objects degraded (0.007%), 3 
pgs unclean, 1 pg degraded, 1 pg undersized
    pg 4.43 is stuck undersized for 4933.586411, current state 
active+recovery_wait+undersized+degraded+remapped, last acting [27]
    pg 7.4 is stuck unclean for 30429.018746, current state down+remapped, last 
acting [2147483647,2147483647,31,2147483647,32,2147483647]
    pg 7.f is stuck unclean for 30429.010752, current state down+remapped, last 
acting [2147483647,2147483647,32,2147483647,30,2147483647]



We've happily been running a 6 node cluster with 4 x FileStore HDDs per node 
(journals on SSD partitions) for over a year and recently upgraded all nodes to 
Debian 9, Ceph Luminous 12.2.2 and kernel 4.13.8. We ordered 12 x Intel DC 
S4600 SSDs which arrived last week so we added two per node on Thursday evening 
and brought them up as BlueStore OSDs. We had proactively updated our existing 
pools to reference only devices classed as 'hdd', so that we could move select 
images over to ssd replicated and erasure coded pools.

We were pretty diligent and downloaded Intel's Firmware Update Tool and 
validated that each new drive had the latest available firmware before 
installing them in the nodes. We did numerous benchmarks on Friday and 
eventually moved some images over to the new storage pools. Everything was 
working perfectly and extensive tests on Sunday showed excellent performance. 
Sunday night one of the new SSDs died and Ceph replicated and redistributed 
data accordingly, then another failed in the early hours of Monday morning and 
Ceph did what it needed to.

We had the two failed drives replaced by 11am and Ceph was up to 2/4918587 
objects degraded (0.000%) when a third drive failed. At this point we updated 
the crush maps for the rbd_ssd and ec_ssd pools and set the device class to 
'hdd', to essentially evacuate everything off the SSDs. Other SSDs then failed 
at 3:22pm, 4:19pm, 5:49pm and 5:50pm. We've ultimately lost half the Intel 
S4600 drives, which are all completely inaccessible. Our status at 11:42pm 
Monday night was: 1/1398478 objects unfound (0.000%) and 339/4633062 objects 
degraded (0.007%).


We're essentially looking for assistance with:
 - Copying images from the damaged pools (attempting to access these currently 
results in requests which never timeout)
 - Advice on how to later, if Intel can unlock the failed SSDs, import the 
missing object shards to gain full access to the images


There are two pools that were configured to use devices with a 'ssd' device 
class, namely a replicated pool called 'rbd_ssd' (size 3) and an erasure coded 
pool 'ec_ssd' (k=4 and m=2). One 80 GB image was directly in the rbd_ssd pool 
and 5 images totalling 1.4 TB were also in the rbd_ssd pool, but with their 
data in the ec_ssd pool.


What we've done so far:
 - Run Intel diagnostic tools on the failed SSDs, which report 'logically 
locked, contact Intel support'. This is giving us some hope that Intel, should 
the case finally land with someone that talks English, can somehow unlock the 
drives.
 - Stopped the rest of the Intel SSD OSDs, peeked at the content of the 
BlueStore (ceph-objectstore-tool --op fuse --data-path 
/var/lib/ceph/osd/ceph-34 --mountpoint /mnt) unmounted it again and then 
exported placement groups:
        eg: ceph-objectstore-tool --op export --pgid 7.fs1 --data-path 
/var/lib/ceph/osd/ceph-34 --file /exported_data/osd34_7.fs1.export
 - Tried 'ceph osd force-create-pg X', 'ceph pg deep-scrub X', 'ceph osd lost 
$ID --yes-i-really-mean-it'

We originally deleted the failed OSDs but 'ceph pg 4.43 mark_unfound_lost 
delete' yielded 'Error EINVAL: pg has 1 unfound objects but we haven't probed 
all sources, not marking lost' so we recreated all the previous OSDs in 
partitions of a different SSD and imported the previous exports with the OSD 
services stopped.


Status now:
[admin@kvm5b ~]# ceph health detail
HEALTH_WARN noout flag(s) set; Reduced data availability: 3 pgs inactive, 3 pgs 
down; Degraded data redundancy: 3 pgs unclean
OSDMAP_FLAGS noout flag(s) set
PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 3 pgs down
    pg 4.43 is down, acting [4,15,18]
    pg 7.4 is down, acting [8,5,21,18,15,0]
    pg 7.f is down, acting [23,0,16,5,11,14]
PG_DEGRADED Degraded data redundancy: 3 pgs unclean
    pg 4.43 is stuck unclean since forever, current state down, last acting 
[4,15,18]
    pg 7.4 is stuck unclean since forever, current state down, last acting 
[8,5,21,18,15,0]
    pg 7.f is stuck unclean since forever, current state down, last acting 
[23,0,16,5,11,14]

Original 'ceph pg X query' status (before we mucked around by exporting and 
deleting OSDs): https://pastebin.com/fBQhq6UQ
Current 'ceph pg X query' status (after recreating temporary OSDs with the 
original IDs and importing the exports): https://pastebin.com/qcN5uYkN


What we assume needs to be done:
 - Tell Ceph that the OSDs are lost (query status in the pastebin above reports 
'starting or marking this osd lost may let us proceed'). We have stopped, 
marked the temporary OSDs as out and run 'ceph osd lost $ID 
--yes-i-really-mean-it' already though.
  - Somehow get Ceph to forget about the sharded objects it doesn't have 
sufficient pieces of.
  - Copy the images to another pool so that we can get pieces of data off these 
and rebuild those systems.
  - Hopefully get Intel to unlock the drives, export as much of the content as 
possible and import the various exports so that we can ultimately copy off 
complete images.

Really, really hoping to have a Merry Christmas... ;)

PS: We got the 80 GB image out, it had a single 4MB object hole so we used 
ddrescue to read the source image forwards, rebooted the node when it stalled 
on the missing data and repeated the copy in reverse direction thereafter...


Regards
David Herselman

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Many concurrent drive failures - How do I activate pgs?

Reply via email to