Re: [Linux-HA] Antw: Node remains offline after host restart

James Guthrie Fri, 26 Oct 2012 01:25:42 -0700

Hi Ulrich,

The cluster isn't being started automatically at boot time, I am 
manually starting the cluster with /etc/init.d/corosync start and 
/etc/init.d/pacemaker start.

I have had a look through the logfiles of both the online and offline 
hosts during the start of pacemaker on the offline host. I have links to 
pasetebins of the logs here.

http://pastebin.com/HUxVYT84 is the "offline" host log.
http://pastebin.com/5tA2bNeq is the "online" host log.

The logfile of the offline host shows logs from when corosync and 
pacemaker were started on that host. The logfile of the online host 
shows logs from when corosync and pacemaker were stopped and then restarted.
The logfiles are quite long as I was also able to observe the 
"flip-flop" that I mentioned before: the state becomes the exact inverse 
to what it was previously.

I am unable to see/understand a sign of why the host that is starting up 
doesn't come online. What I have pinpointed and seems relevant are lines 
751-766 of the "online" host log. Up until that point it appears as 
though the node r3 (the "offline" node) has come online and everything 
is running smoothly. The lines 751-766 seem to be the first indication 
on the "online" host that something's not right. I cannot find something 
on the "offline" host to indicate what the problem could be.

The logs have the system time for each entry, the system times on both 
hosts is within a few tens of milliseconds. This should help to pinpoint 
whereabouts in the logs of the "offline" host something is going wrong. 
I was unable to find anything glaringly obvious, but I'm no expert.

Regards,
James

On 10/26/2012 07:56 AM, Ulrich Windl wrote:
> Hi James,
>
> the cluster stack starts automatically on boot of the offline host? If so, 
> the node probably won't become online immediately. The syslog (unless 
> redirected) of the offline node will provide initial clues what is going on. 
> Also see the syslog of the online node. Maybe watching the syslog (tail -f) 
> of the online node while the other node boots is a good idea.
>
> Regards,
> Ulrich
>
>>>> James Guthrie <[email protected]> schrieb am 25.10.2012 um 18:13 in Nachricht
> <[email protected]>:
>> Hi all,
>>
>> I've been battling with this problem for a few hours now, I've gone over
>> the obvious errors that it could have been with the guys in the linux-ha
>> IRC. I'd really like some help in trying to solve this problem.
>>
>> I have a two node corosync/pacemaker cluster (corosync: 2.0.1 pacemaker:
>> 1.1.8). I can get the cluster to work fine, but I can also very easily
>> get the cluster into a state from which it seems unable to recover. All
>> I have to do is reboot one of the cluster node's hosts. When doing so,
>> any resources that were running on it are transferred to the second
>> host. When the host comes back up though it appears as OFFLINE in the
>> crm_mon of both cluster nodes.
>>
>> Regardless of what I do on the "offline" host, nothing gets better. If I
>> however stop and restart corosync/pacemaker on the other "online" host,
>> then everything seems to work again.
>>
>> I tried waiting a while with one node offline, after a while the online
>> node went offline, stating that the other node was now offline. For a
>> few minutes the output of crm_mon was different on both hosts (both
>> thought the other was online, they were offline). Then finally it
>> settled in the exact opposite state as previously.
>>
>> I've had a long look through the logs but I don't seem to be able to
>> pinpoint anything particular that tells me that there is a reason for
>> that host failing to be online.
>>
>> I'd like to attach the logs, but thought that approx 1500 lines of
>> additional text in this e-mail might be a bit too much.
>>
>> How should I best attach the logs and config files? Which parts of the
>> logs and config files would most likely reveal the problem in this case?
>>
>> Regards,
>> James
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Node remains offline after host restart

Reply via email to