Hello,
Too late I see, but still...
On Tue, 6 Sep 2016 22:17:05 -0400 Shain Miley wrote:
Hello,
It looks like we had 2 osd's fail at some point earlier today, here is
the current status of the cluster:
You will really want to find out how and why that happened, because while
not impossible this is pretty improbable.
Something like HW, are the OSDs on the same host, or maybe an OOM event,
etc.
root@rbd1:~# ceph -s
cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
health HEALTH_WARN
2 pgs backfill
5 pgs backfill_toofull
Bad, you will want your OSDs back in and then some.
Have a look at "ceph osd df".
69 pgs backfilling
74 pgs degraded
1 pgs down
1 pgs peering
Not good either.
W/o bringing back your OSDs that means doom for the data on those PGs.
74 pgs stuck degraded
1 pgs stuck inactive
75 pgs stuck unclean
74 pgs stuck undersized
74 pgs undersized
recovery 1903019/105270534 objects degraded (1.808%)
recovery 1120305/105270534 objects misplaced (1.064%)
crush map has legacy tunables
monmap e1: 3 mons at
{hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}
election epoch 282, quorum 0,1,2 hqceph1,hqceph2,hqceph3
osdmap e25019: 108 osds: 105 up, 105 in; 74 remapped pgs
pgmap v30721368: 3976 pgs, 17 pools, 144 TB data, 51401 kobjects
285 TB used, 97367 GB / 380 TB avail
1903019/105270534 objects degraded (1.808%)
1120305/105270534 objects misplaced (1.064%)
3893 active+clean
69 active+undersized+degraded+remapped+backfilling
6 active+clean+scrubbing
3 active+undersized+degraded+remapped+backfill_toofull
2 active+clean+scrubbing+deep
When in recovery/backfill situations, you always want to stop any and all
scrubbing.
2
active+undersized+degraded+remapped+wait_backfill+backfill_toofull
1 down+peering
recovery io 248 MB/s, 84 objects/s
We had been running for a while with 107 osd's (not 108), it looks like
osd's 64 and 76 are both now down and out at this point.
I have looked though the ceph logs for each osd and did not see anything
obvious, the raid controller also does not show the disk offline.
Get to the bottom of that, normally something gets logged when an OSD
fails.
I am wondering if I should try to restart the two osd's that are showing
as down...or should I wait until the current recovery is complete?
As said, try to restart immediately, just to keep the traffic down for
starters.
The pool has a replica level of '2'...and with 2 failed disks I want to
do whatever I can to make sure there is not an issue with missing objects.
I sure hope that pool holds backups or something of that nature.
The only times when a replica of 2 isn't a cry for Murphy to smite you is
with RAID backed OSDs or VERY well monitored and vetted SSDs.
Thanks in advance,
Shain
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com