[ceph-users] cluster unavailable for 20 mins when downed server was reintroduced

Sean Purdy Tue, 15 Aug 2017 04:23:40 -0700

Luminous 12.1.1 rc1

Hi,



I have a three node cluster with 6 OSD and 1 mon per node.

I had to turn off one node for rack reasons.  While the node was down, the 
cluster was still running and accepting files via radosgw.  However, when I 
turned the machine back on, radosgw uploads stopped working and things like 
"ceph status" starting timed out.  It took 20 minutes for "ceph status" to be 
OK.  

In the recent past I've rebooted one or other node and the cluster kept 
working, and when the machine came back, the OSDs and monitor rejoined the 
cluster and things went on as usual.

The machine was off for 21 hours or so.

Any idea what might be happening, and how to mitigate the effects of this next 
time a machine has to be down for any length of time?


"ceph status" said:

2017-08-15 11:28:29.835943 7fdf2d74b700  0 monclient(hunting): authenticate 
timed out after 300                        2017-08-15 11:28:29.835993 
7fdf2d74b700  0 librados: client.admin authentication error (110) Connection 
timed out


monitor log said things like this before everything came together:

2017-08-15 11:23:07.180123 7f11c0fcc700  0 -- 172.16.0.43:0/2471 >> 
172.16.0.45:6812/1904 conn(0x556eeaf4d000 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply 
connect got BADAUTHORIZER

but "ceph --admin-daemon /var/run/ceph/ceph-mon.xxx.asok quorum_status" did 
work.  This monitor node was detected but not yet in quorum.


OSDs had 15 minutes of

ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-9: (2) No such 
file or directory

before becoming available.


Advice welcome.

Thanks,

Sean Purdy
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cluster unavailable for 20 mins when downed server was reintroduced

Reply via email to