Hi all,

I've been battling with this problem for a few hours now, I've gone over 
the obvious errors that it could have been with the guys in the linux-ha 
IRC. I'd really like some help in trying to solve this problem.

I have a two node corosync/pacemaker cluster (corosync: 2.0.1 pacemaker: 
1.1.8). I can get the cluster to work fine, but I can also very easily 
get the cluster into a state from which it seems unable to recover. All 
I have to do is reboot one of the cluster node's hosts. When doing so, 
any resources that were running on it are transferred to the second 
host. When the host comes back up though it appears as OFFLINE in the 
crm_mon of both cluster nodes.

Regardless of what I do on the "offline" host, nothing gets better. If I 
however stop and restart corosync/pacemaker on the other "online" host, 
then everything seems to work again.

I tried waiting a while with one node offline, after a while the online 
node went offline, stating that the other node was now offline. For a 
few minutes the output of crm_mon was different on both hosts (both 
thought the other was online, they were offline). Then finally it 
settled in the exact opposite state as previously.

I've had a long look through the logs but I don't seem to be able to 
pinpoint anything particular that tells me that there is a reason for 
that host failing to be online.

I'd like to attach the logs, but thought that approx 1500 lines of 
additional text in this e-mail might be a bit too much.

How should I best attach the logs and config files? Which parts of the 
logs and config files would most likely reveal the problem in this case?

Regards,
James

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to