> -----Original Message-----
> From: ceph-users [mailto:[email protected]] On Behalf Of
> Gregory Farnum
> Sent: 01 July 2015 16:56
> To: Daniel Schneller
> Cc: [email protected]
> Subject: Re: [ceph-users] Node reboot -- OSDs not "logging off" from cluster
>
> On Tue, Jun 30, 2015 at 10:36 AM, Daniel Schneller
> <[email protected]> wrote:
> > Hi!
> >
> > We are seeing a strange - and problematic - behavior in our 0.94.1
> > cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each.
> >
> > When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs
> > do not seem to shut down correctly. Clients hang and ceph osd tree
> > show the OSDs of that node still up. Repeated runs of ceph osd tree
> > show them going down after a while. For instance, here OSD.7 is still
> > up, even though the machine is in the middle of the reboot cycle.
> >
> > [C|root@control01] ~ ➜ ceph osd tree
> > # id weight type name up/down reweight
> > -1 36.2 root default
> > -2 7.24 host node01
> > 0 1.81 osd.0 up 1
> > 5 1.81 osd.5 up 1
> > 10 1.81 osd.10 up 1
> > 15 1.81 osd.15 up 1
> > -3 7.24 host node02
> > 1 1.81 osd.1 up 1
> > 6 1.81 osd.6 up 1
> > 11 1.81 osd.11 up 1
> > 16 1.81 osd.16 up 1
> > -4 7.24 host node03
> > 2 1.81 osd.2 down 1
> > 7 1.81 osd.7 up 1
> > 12 1.81 osd.12 down 1
> > 17 1.81 osd.17 down 1
> > -5 7.24 host node04
> > 3 1.81 osd.3 up 1
> > 8 1.81 osd.8 up 1
> > 13 1.81 osd.13 up 1
> > 18 1.81 osd.18 up 1
> > -6 7.24 host node05
> > 4 1.81 osd.4 up 1
> > 9 1.81 osd.9 up 1
> > 14 1.81 osd.14 up 1
> > 19 1.81 osd.19 up 1
> >
> > So it seems, the services are either not shut down correctly when the
> > reboot begins, or they do not get enough time to actually let the
> > cluster know they are going away.
> >
> > If I stop the OSDs on that node manually before the reboot, everything
> > works as expected and clients don't notice any interruptions.
> >
> > [C|root@node03] ~ ➜ service ceph-osd stop id=2 ceph-osd stop/waiting
> > [C|root@node03] ~ ➜ service ceph-osd stop id=7 ceph-osd stop/waiting
> > [C|root@node03] ~ ➜ service ceph-osd stop id=12 ceph-osd
> > stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=17
> > ceph-osd stop/waiting [C|root@node03] ~ ➜ reboot
> >
> > The upstart file was not changed from the packaged version.
> > Interestingly, the same Ceph version on a different cluster does _not_
> > show this behaviour.
> >
> > Any ideas as to what is causing this or how to diagnose this?
Do you have the OSD's running on the same boxes as the monitors?
>
> I'm not sure why it would be happening, but:
> * The OSDs send out shutdown messages to the monitor indicating they're
> going away whenever they get shut down politely. There's a short timeout to
> make sure they don't hang on you.
> * The only way the OSD doesn't get marked down during reboot is if the
> monitor doesn't get this message.
> * If the monitor isn't getting the message, the OSD either isn't sending the
> message or it's getting blocked.
>
> My guess is that for some reason the OSDs are getting the shutdown signal
> after the networking goes away.
> -Greg
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com