Re: [ceph-users] Weird cluster restart behavior

Gregory Farnum Tue, 31 Mar 2015 12:35:44 -0700

On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
<qhart...@direwolfdigital.com> wrote:
> I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last
> friday I got everything deployed and all was working well, and I set noout
> and shut all the OSD nodes down over the weekend. Yesterday when I spun it
> back up, the OSDs were behaving very strangely, incorrectly marking each
> other because of missed heartbeats, even though they were up. It looked like
> some kind of low-level networking problem, but I couldn't find any.
>
> After much work, I narrowed the apparent source of the problem down to the
> OSDs running on the first host I started in the morning. They were the ones
> that were logged the most messages about not being able to ping other OSDs,
> and the other OSDs were mostly complaining about them. After running out of
> other ideas to try, I restarted them, and then everything started working.
> It's still working happily this morning. It seems as though when that set of
> OSDs started they got stale OSD map information from the MON boxes, which
> failed to be updated as the other OSDs came up. Does that make sense? I
> still don't consider myself an expert on ceph architecture and would
> appreciate and corrections or other possible interpretations of events (I'm
> happy to provide whatever additional information I can) so I can get a
> deeper understanding of things. If my interpretation of events is correct,
> it seems that might point at a bug.


I can't find the ticket now, but I think we did indeed have a bug
around heartbeat failures when restarting nodes. This has been fixed
in other branches but might have been missed for giant. (Did you by
any chance set the nodown flag as well as noout?)

In general Ceph isn't very happy with being shut down completely like
that and its behaviors aren't validated, so nothing will go seriously
wrong but you might find little irritants like this. It's particularly
likely when you're prohibiting state changes with the noout/nodown
flags.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird cluster restart behavior

Reply via email to