Re: [ceph-users] Ceph OSDs are down and cannot be started

Somnath Roy Tue, 07 Jul 2015 10:49:37 -0700

Run :
'ceph-osd -i 0 -f' in a console and see what is the output.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:[email protected]] On Behalf Of Fredy 
Neeser
Sent: Tuesday, July 07, 2015 9:15 AM
To: [email protected]
Subject: [ceph-users] Ceph OSDs are down and cannot be started

Hi,

I had a working Ceph Hammer test setup with 3 OSDs and 1 MON (running on VMs), 
and RBD was working fine.

The setup was not touched for two weeks (also no I/O activity), and when I 
looked again, the cluster was in a bad state:

On the MON node (sto-vm20):
$ ceph health
HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down

$ ceph health detail
HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down pg 0.22 is 
stuck stale for 1457679.263525, current state stale+active
+clean, last acting [2,1,0]
pg 0.21 is stuck stale for 1457679.263529, current state stale+active
+clean, last acting [1,2,0]
pg 0.20 is stuck stale for 1457679.263531, current state stale+active
+clean, last acting [1,0,2]
pg 0.1f is stuck stale for 1457679.263533, current state stale+active
+clean, last acting [2,0,1]
...
pg 0.24 is stuck stale for 1457679.263625, current state stale+active
+clean, last acting [2,0,1]
pg 0.23 is stuck stale for 1457679.263627, current state stale+active
+clean, last acting [1,2,0]
osd.0 is down since epoch 16, last address 9.4.68.111:6800/1658
osd.1 is down since epoch 16, last address 9.4.68.112:6800/1659
osd.2 is down since epoch 16, last address 9.4.68.113:6800/1654

On the OSD nodes (sto-vm21, sto-vm22, sto-vm23), no Ceph daemon is running:
$ ps -ef | egrep "ceph|osd|rados"
(returns nothing)

I rebooted the OSDs  as well as the MON, but still only the ceph-mon daemon is 
running on the MON node.

I tried to start the OSDs manually by executing $ sudo /etc/init.d/ceph start 
osd on the OSD nodes, but I saw neither an error message nor alogfile update.

On the OSD nodes, the log files in /var/log/ceph have no longer been updated 
since the failure event.

What is strange is that the OSDs no longer have any admin socket files (which 
should normally be in /run/ceph), whereas the MON node does have an admin 
socket:
$ ls -la /run/ceph
srwxr-xr-x  1 root root   0 Jul  7 15:27 ceph-mon.sto-vm20.asok

This looks very similar to
http://tracker.ceph.com/issues/7188
Bug #7188: Admin socket files are lost on log rotation calling initctl reload 
(ubuntu 13.04 only)

Any ideas how to restart / recover the OSDs are much appreciated.
How can I start the OSD daemon(s) such that I can see any errors?

Thanks,
- Fredy

PS: The Ceph setup is on  Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-41-generic
x86_64)

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSDs are down and cannot be started

Reply via email to