Re: [Linux-HA] Node remains offline after host restart

Emmanuel Saint-Joanis Fri, 26 Oct 2012 01:49:00 -0700

It seems like (CRMd/pEngine) thinks : "I didn't manage to shoot the failing
node, therefore I (kind of) blacklist it as soon as I get control on it"
Did you test extensively that your config works with ->
stonith-enabled="false" <- first ?



2012/10/26 James Guthrie <[email protected]>

> Hi Emmanuel,
>
> corosync is bound to the correct interface on both hosts.
>
> I looked for that line in the logs, but it didn't appear.
>
> My previous e-mail addressed to Ulrich contains logfiles and a broad
> explanation of the process that those logfiles capture.
>
> Regards,
> James
>
> On 10/25/2012 06:34 PM, Emmanuel Saint-Joanis wrote:
> > Looks like a common timeout issue in network upcoming.
> >
> > See if corosync is bound to 127.0.0.1 instead of real interface with :
> > corosync-cmapctl | grep member
> >
> > Also check if no line is appearing in /var/log/messages :
> > WARN: cib_peer_callback: Discarding cib_apply_diff message (322) from
> > server2: not in our membership
> >
> > Send logs to any web service as pastebin.com <http://pastebin.com>.
> >
> > 2012/10/25 James Guthrie <[email protected] <mailto:[email protected]>>
> >
> >     Hi all,
> >
> >     I've been battling with this problem for a few hours now, I've gone
> over
> >     the obvious errors that it could have been with the guys in the
> linux-ha
> >     IRC. I'd really like some help in trying to solve this problem.
> >
> >     I have a two node corosync/pacemaker cluster (corosync: 2.0.1
> pacemaker:
> >     1.1.8). I can get the cluster to work fine, but I can also very
> easily
> >     get the cluster into a state from which it seems unable to recover.
> All
> >     I have to do is reboot one of the cluster node's hosts. When doing
> so,
> >     any resources that were running on it are transferred to the second
> >     host. When the host comes back up though it appears as OFFLINE in the
> >     crm_mon of both cluster nodes.
> >
> >     Regardless of what I do on the "offline" host, nothing gets better.
> If I
> >     however stop and restart corosync/pacemaker on the other "online"
> host,
> >     then everything seems to work again.
> >
> >     I tried waiting a while with one node offline, after a while the
> online
> >     node went offline, stating that the other node was now offline. For a
> >     few minutes the output of crm_mon was different on both hosts (both
> >     thought the other was online, they were offline). Then finally it
> >     settled in the exact opposite state as previously.
> >
> >     I've had a long look through the logs but I don't seem to be able to
> >     pinpoint anything particular that tells me that there is a reason for
> >     that host failing to be online.
> >
> >     I'd like to attach the logs, but thought that approx 1500 lines of
> >     additional text in this e-mail might be a bit too much.
> >
> >     How should I best attach the logs and config files? Which parts of
> the
> >     logs and config files would most likely reveal the problem in this
> case?
> >
> >     Regards,
> >     James
> >
> >     _______________________________________________
> >     Linux-HA mailing list
> >     [email protected] <mailto:[email protected]>
> >     http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >     See also: http://linux-ha.org/ReportingProblems
> >
> >
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node remains offline after host restart

Reply via email to