Re: [Linux-HA] Node remains offline after host restart

James Guthrie Fri, 26 Oct 2012 02:14:10 -0700

Hi Emmanuel,

I should maybe have mentioned earlier that I'm not using either of the 
subshells for pacemaker, I'm configuring everything via XML. Also, I 
don't and won't have python compiled in my environment, so any crm 
commands are a no-go.


Regards,
James


On 10/26/2012 11:10 AM, Emmanuel Saint-Joanis wrote:
> just to see the syntax (not easy in XML), if it shows something
> obviously bad
> can U paste the : crm configure show
>
> 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]>>
>
>     Hi Emmanuel,
>
>     It might help for further debugging to attach my pacemaker config, so
>     here's a pastebin of `cibadmin -Ql` as it is on the cluster right now -
>     still in the state of one node being "offline" and the other online.
>
>     http://pastebin.com/s3kr6Fxx
>
>     As you can see in the config, I have stonith disabled.
>
>     Regards,
>     James
>
>     On 10/26/2012 10:48 AM, Emmanuel Saint-Joanis wrote:
>      > It seems like (CRMd/pEngine) thinks : "I didn't manage to shoot the
>      > failing node, therefore I (kind of) blacklist it as soon as I get
>      > control on it"
>      > Did you test extensively that your config works with ->
>      > stonith-enabled="false" <- first ?
>      >
>      >
>      > 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>>
>      >
>      >     Hi Emmanuel,
>      >
>      >     corosync is bound to the correct interface on both hosts.
>      >
>      >     I looked for that line in the logs, but it didn't appear.
>      >
>      >     My previous e-mail addressed to Ulrich contains logfiles and
>     a broad
>      >     explanation of the process that those logfiles capture.
>      >
>      >     Regards,
>      >     James
>      >
>      >     On 10/25/2012 06:34 PM, Emmanuel Saint-Joanis wrote:
>      >      > Looks like a common timeout issue in network upcoming.
>      >      >
>      >      > See if corosync is bound to 127.0.0.1 instead of real
>     interface
>      >     with :
>      >      > corosync-cmapctl | grep member
>      >      >
>      >      > Also check if no line is appearing in /var/log/messages :
>      >      > WARN: cib_peer_callback: Discarding cib_apply_diff message
>     (322) from
>      >      > server2: not in our membership
>      >      >
>      >      > Send logs to any web service as pastebin.com
>     <http://pastebin.com>
>      >     <http://pastebin.com> <http://pastebin.com>.
>      >      >
>      >      > 2012/10/25 James Guthrie <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>
>      >     <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>>>
>      >      >
>      >      >     Hi all,
>      >      >
>      >      >     I've been battling with this problem for a few hours now,
>      >     I've gone over
>      >      >     the obvious errors that it could have been with the
>     guys in
>      >     the linux-ha
>      >      >     IRC. I'd really like some help in trying to solve this
>     problem.
>      >      >
>      >      >     I have a two node corosync/pacemaker cluster
>     (corosync: 2.0.1
>      >     pacemaker:
>      >      >     1.1.8). I can get the cluster to work fine, but I can also
>      >     very easily
>      >      >     get the cluster into a state from which it seems unable to
>      >     recover. All
>      >      >     I have to do is reboot one of the cluster node's
>     hosts. When
>      >     doing so,
>      >      >     any resources that were running on it are transferred
>     to the
>      >     second
>      >      >     host. When the host comes back up though it appears as
>      >     OFFLINE in the
>      >      >     crm_mon of both cluster nodes.
>      >      >
>      >      >     Regardless of what I do on the "offline" host, nothing
>     gets
>      >     better. If I
>      >      >     however stop and restart corosync/pacemaker on the other
>      >     "online" host,
>      >      >     then everything seems to work again.
>      >      >
>      >      >     I tried waiting a while with one node offline, after a
>     while
>      >     the online
>      >      >     node went offline, stating that the other node was now
>      >     offline. For a
>      >      >     few minutes the output of crm_mon was different on
>     both hosts
>      >     (both
>      >      >     thought the other was online, they were offline). Then
>     finally it
>      >      >     settled in the exact opposite state as previously.
>      >      >
>      >      >     I've had a long look through the logs but I don't seem
>     to be
>      >     able to
>      >      >     pinpoint anything particular that tells me that there is a
>      >     reason for
>      >      >     that host failing to be online.
>      >      >
>      >      >     I'd like to attach the logs, but thought that approx 1500
>      >     lines of
>      >      >     additional text in this e-mail might be a bit too much.
>      >      >
>      >      >     How should I best attach the logs and config files? Which
>      >     parts of the
>      >      >     logs and config files would most likely reveal the
>     problem in
>      >     this case?
>      >      >
>      >      >     Regards,
>      >      >     James
>      >      >
>      >      >     _______________________________________________
>      >      >     Linux-HA mailing list
>      >      > [email protected]
>     <mailto:[email protected]>
>     <mailto:[email protected]
>     <mailto:[email protected]>>
>      >     <mailto:[email protected]
>     <mailto:[email protected]>
>      >     <mailto:[email protected]
>     <mailto:[email protected]>>>
>      >      > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>      >      >     See also: http://linux-ha.org/ReportingProblems
>      >      >
>      >      >
>      >
>      >     _______________________________________________
>      >     Linux-HA mailing list
>      > [email protected] <mailto:[email protected]>
>     <mailto:[email protected]
>     <mailto:[email protected]>>
>      > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>      >     See also: http://linux-ha.org/ReportingProblems
>      >
>      >
>
>     _______________________________________________
>     Linux-HA mailing list
>     [email protected] <mailto:[email protected]>
>     http://lists.linux-ha.org/mailman/listinfo/linux-ha
>     See also: http://linux-ha.org/ReportingProblems
>
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node remains offline after host restart

Reply via email to