Hi

I am running a small cluster of 8 machines (80 osds), with three monitors
on Ubuntu 16.04. Ceph version 10.2.5.

I cannot reboot the monitors without physically going into the datacenter
and power cycling them. What happens is that while shutting down, ceph gets
stuck trying to contact the other monitors but networking has already shut
down or something like that. I get an endless stream of:

libceph: connect 10.20.0.10:6789 error -101
libceph: connect 10.20.0.13:6789 error -101
libceph: connect 10.20.0.17:6789 error -101

where in this case 10.20.0.10 is the machine I am trying to shut down and
all three IPs are the MONs.

At this stage of the shutdown, the machine doesn't respond to pings, and I
cannot even log in on any of the virtual terminals. Nothing to do but
poweroff at the server.

The other non-mon servers shut down just fine, and the cluster was healthy
at the time I was rebooting the mon (I only reboot one machine at a time,
waiting for it to come up before I do the next one).

Also worth mentioning that if I execute

sudo systemctl stop ceph\*.service ceph\*.target

on the server, the only things I see are:

root     11143     2  0 18:40 ?        00:00:00 [ceph-msgr]
root     11162     2  0 18:40 ?        00:00:00 [ceph-watch-noti]

and even then, when no ceph daemons are left running, doing a reboot goes
into the same loop.

I can't really find any mention of this online, but I feel someone must
have hit this. Any idea how to fix it? It's really annoying because its
hard for me to get access to the datacenter.

Thanks
Michael
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to