Yesterday I noticed some OSDs were missing from our cluster (96 OSDs total, 
84up/84in is what showed).

After drilling down to determine which node and the cause, I found that all the 
OSDs on that node (12 total) were in fact down.

I entered 'systemctl status ceph-osd@$osd_number' to determine exactly why they 
were down, and came up with:
Fail to open '/proc/0/cmdline' error = (2) No such file or directory
received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
osd.72 1067 *** Got signal Interrupt ***
osd.72 1067 shutdown

This happened on all twelve OSDs (osd.72-osd.83).  On four, it happened the 
previous evening around 9pm EST and the other eight happened at roughly 2am EST 
the morning I discovered the issue (discovered around 9am EST).

Has anyone ever come across something like this or perhaps know of a fix?  This 
hasn't happened since, but this being a newly built-out cluster it was a bit 
concerning.

Thanks in advance.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to