Hi Emmanuel, I should maybe have mentioned earlier that I'm not using either of the subshells for pacemaker, I'm configuring everything via XML. Also, I don't and won't have python compiled in my environment, so any crm commands are a no-go.
Regards, James On 10/26/2012 11:10 AM, Emmanuel Saint-Joanis wrote: > just to see the syntax (not easy in XML), if it shows something > obviously bad > can U paste the : crm configure show > > 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]>> > > Hi Emmanuel, > > It might help for further debugging to attach my pacemaker config, so > here's a pastebin of `cibadmin -Ql` as it is on the cluster right now - > still in the state of one node being "offline" and the other online. > > http://pastebin.com/s3kr6Fxx > > As you can see in the config, I have stonith disabled. > > Regards, > James > > On 10/26/2012 10:48 AM, Emmanuel Saint-Joanis wrote: > > It seems like (CRMd/pEngine) thinks : "I didn't manage to shoot the > > failing node, therefore I (kind of) blacklist it as soon as I get > > control on it" > > Did you test extensively that your config works with -> > > stonith-enabled="false" <- first ? > > > > > > 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > > > > Hi Emmanuel, > > > > corosync is bound to the correct interface on both hosts. > > > > I looked for that line in the logs, but it didn't appear. > > > > My previous e-mail addressed to Ulrich contains logfiles and > a broad > > explanation of the process that those logfiles capture. > > > > Regards, > > James > > > > On 10/25/2012 06:34 PM, Emmanuel Saint-Joanis wrote: > > > Looks like a common timeout issue in network upcoming. > > > > > > See if corosync is bound to 127.0.0.1 instead of real > interface > > with : > > > corosync-cmapctl | grep member > > > > > > Also check if no line is appearing in /var/log/messages : > > > WARN: cib_peer_callback: Discarding cib_apply_diff message > (322) from > > > server2: not in our membership > > > > > > Send logs to any web service as pastebin.com > <http://pastebin.com> > > <http://pastebin.com> <http://pastebin.com>. > > > > > > 2012/10/25 James Guthrie <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > > <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>>> > > > > > > Hi all, > > > > > > I've been battling with this problem for a few hours now, > > I've gone over > > > the obvious errors that it could have been with the > guys in > > the linux-ha > > > IRC. I'd really like some help in trying to solve this > problem. > > > > > > I have a two node corosync/pacemaker cluster > (corosync: 2.0.1 > > pacemaker: > > > 1.1.8). I can get the cluster to work fine, but I can also > > very easily > > > get the cluster into a state from which it seems unable to > > recover. All > > > I have to do is reboot one of the cluster node's > hosts. When > > doing so, > > > any resources that were running on it are transferred > to the > > second > > > host. When the host comes back up though it appears as > > OFFLINE in the > > > crm_mon of both cluster nodes. > > > > > > Regardless of what I do on the "offline" host, nothing > gets > > better. If I > > > however stop and restart corosync/pacemaker on the other > > "online" host, > > > then everything seems to work again. > > > > > > I tried waiting a while with one node offline, after a > while > > the online > > > node went offline, stating that the other node was now > > offline. For a > > > few minutes the output of crm_mon was different on > both hosts > > (both > > > thought the other was online, they were offline). Then > finally it > > > settled in the exact opposite state as previously. > > > > > > I've had a long look through the logs but I don't seem > to be > > able to > > > pinpoint anything particular that tells me that there is a > > reason for > > > that host failing to be online. > > > > > > I'd like to attach the logs, but thought that approx 1500 > > lines of > > > additional text in this e-mail might be a bit too much. > > > > > > How should I best attach the logs and config files? Which > > parts of the > > > logs and config files would most likely reveal the > problem in > > this case? > > > > > > Regards, > > > James > > > > > > _______________________________________________ > > > Linux-HA mailing list > > > [email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > > <mailto:[email protected] > <mailto:[email protected]> > > <mailto:[email protected] > <mailto:[email protected]>>> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > > > > > > > > > _______________________________________________ > > Linux-HA mailing list > > [email protected] <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] <mailto:[email protected]> > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
