[Linux-HA] Antw: Re: Node remains offline after host restart

Ulrich Windl Fri, 26 Oct 2012 03:34:41 -0700

Hi!

Just one idea: "crm_verify -L" is fine?


Regards,
Ulrich

>>> James Guthrie <[email protected]> schrieb am 26.10.2012 um 11:14 in Nachricht
<[email protected]>:
> Hi Emmanuel,
> 
> I should maybe have mentioned earlier that I'm not using either of the 
> subshells for pacemaker, I'm configuring everything via XML. Also, I 
> don't and won't have python compiled in my environment, so any crm 
> commands are a no-go.
> 
> Regards,
> James
> 
> 
> On 10/26/2012 11:10 AM, Emmanuel Saint-Joanis wrote:
> > just to see the syntax (not easy in XML), if it shows something
> > obviously bad
> > can U paste the : crm configure show
> >
> > 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]>>
> >
> >     Hi Emmanuel,
> >
> >     It might help for further debugging to attach my pacemaker config, so
> >     here's a pastebin of `cibadmin -Ql` as it is on the cluster right now -
> >     still in the state of one node being "offline" and the other online.
> >
> >     http://pastebin.com/s3kr6Fxx 
> >
> >     As you can see in the config, I have stonith disabled.
> >
> >     Regards,
> >     James
> >
> >     On 10/26/2012 10:48 AM, Emmanuel Saint-Joanis wrote:
> >      > It seems like (CRMd/pEngine) thinks : "I didn't manage to shoot the
> >      > failing node, therefore I (kind of) blacklist it as soon as I get
> >      > control on it"
> >      > Did you test extensively that your config works with ->
> >      > stonith-enabled="false" <- first ?
> >      >
> >      >
> >      > 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]>
> >     <mailto:[email protected] <mailto:[email protected]>>>
> >      >
> >      >     Hi Emmanuel,
> >      >
> >      >     corosync is bound to the correct interface on both hosts.
> >      >
> >      >     I looked for that line in the logs, but it didn't appear.
> >      >
> >      >     My previous e-mail addressed to Ulrich contains logfiles and
> >     a broad
> >      >     explanation of the process that those logfiles capture.
> >      >
> >      >     Regards,
> >      >     James
> >      >
> >      >     On 10/25/2012 06:34 PM, Emmanuel Saint-Joanis wrote:
> >      >      > Looks like a common timeout issue in network upcoming.
> >      >      >
> >      >      > See if corosync is bound to 127.0.0.1 instead of real
> >     interface
> >      >     with :
> >      >      > corosync-cmapctl | grep member
> >      >      >
> >      >      > Also check if no line is appearing in /var/log/messages :
> >      >      > WARN: cib_peer_callback: Discarding cib_apply_diff message
> >     (322) from
> >      >      > server2: not in our membership
> >      >      >
> >      >      > Send logs to any web service as pastebin.com
> >     <http://pastebin.com>
> >      >     <http://pastebin.com> <http://pastebin.com>.
> >      >      >
> >      >      > 2012/10/25 James Guthrie <[email protected] <mailto:[email protected]>
> >     <mailto:[email protected] <mailto:[email protected]>>
> >      >     <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] 
> >     <mailto:[email protected]>>>>
> >      >      >
> >      >      >     Hi all,
> >      >      >
> >      >      >     I've been battling with this problem for a few hours now,
> >      >     I've gone over
> >      >      >     the obvious errors that it could have been with the
> >     guys in
> >      >     the linux-ha
> >      >      >     IRC. I'd really like some help in trying to solve this
> >     problem.
> >      >      >
> >      >      >     I have a two node corosync/pacemaker cluster
> >     (corosync: 2.0.1
> >      >     pacemaker:
> >      >      >     1.1.8). I can get the cluster to work fine, but I can also
> >      >     very easily
> >      >      >     get the cluster into a state from which it seems unable to
> >      >     recover. All
> >      >      >     I have to do is reboot one of the cluster node's
> >     hosts. When
> >      >     doing so,
> >      >      >     any resources that were running on it are transferred
> >     to the
> >      >     second
> >      >      >     host. When the host comes back up though it appears as
> >      >     OFFLINE in the
> >      >      >     crm_mon of both cluster nodes.
> >      >      >
> >      >      >     Regardless of what I do on the "offline" host, nothing
> >     gets
> >      >     better. If I
> >      >      >     however stop and restart corosync/pacemaker on the other
> >      >     "online" host,
> >      >      >     then everything seems to work again.
> >      >      >
> >      >      >     I tried waiting a while with one node offline, after a
> >     while
> >      >     the online
> >      >      >     node went offline, stating that the other node was now
> >      >     offline. For a
> >      >      >     few minutes the output of crm_mon was different on
> >     both hosts
> >      >     (both
> >      >      >     thought the other was online, they were offline). Then
> >     finally it
> >      >      >     settled in the exact opposite state as previously.
> >      >      >
> >      >      >     I've had a long look through the logs but I don't seem
> >     to be
> >      >     able to
> >      >      >     pinpoint anything particular that tells me that there is a
> >      >     reason for
> >      >      >     that host failing to be online.
> >      >      >
> >      >      >     I'd like to attach the logs, but thought that approx 1500
> >      >     lines of
> >      >      >     additional text in this e-mail might be a bit too much.
> >      >      >
> >      >      >     How should I best attach the logs and config files? Which
> >      >     parts of the
> >      >      >     logs and config files would most likely reveal the
> >     problem in
> >      >     this case?
> >      >      >
> >      >      >     Regards,
> >      >      >     James
> >      >      >
> >      >      >     _______________________________________________
> >      >      >     Linux-HA mailing list
> >      >      > [email protected] 
> >     <mailto:[email protected]>
> >     <mailto:[email protected] 
> >     <mailto:[email protected]>>
> >      >     <mailto:[email protected] 
> >     <mailto:[email protected]>
> >      >     <mailto:[email protected] 
> >     <mailto:[email protected]>>>
> >      >      > http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >      >      >     See also: http://linux-ha.org/ReportingProblems 
> >      >      >
> >      >      >
> >      >
> >      >     _______________________________________________
> >      >     Linux-HA mailing list
> >      > [email protected] <mailto:[email protected]>
> >     <mailto:[email protected] 
> >     <mailto:[email protected]>>
> >      > http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >      >     See also: http://linux-ha.org/ReportingProblems 
> >      >
> >      >
> >
> >     _______________________________________________
> >     Linux-HA mailing list
> >     [email protected] <mailto:[email protected]>
> >     http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >     See also: http://linux-ha.org/ReportingProblems 
> >
> >
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected] 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 
> 

 
 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Antw: Re: Node remains offline after host restart

Reply via email to