Re: [Linux-HA] Node remains offline after host restart

James Guthrie Mon, 29 Oct 2012 06:46:49 -0700

Hi all,

As mentioned in my previous e-mail, I get different results with 
different nodes as DC. I have now compiled a logfile when using r3 as 
DC, which is the case that always works. I looked into the difference 
between this situation and the previous logfiles. In both instances the 
same action is triggered but something different happens in both cases.


corosync-r3-DC.log: http://pastebin.com/axSRfzEJ
corosync-r4-DC.log: http://pastebin.com/SETtqnZM

On line 567 of r3-DC.log and 572 of r4-DC.log the same thing happens:

crmd:     info: abort_transition_graph:        do_te_invoke:156 - 
Triggered transition abort (complete=1) : Peer Cancelled

With r4 as DC the following takes place (lines 600-620 of r4-DC.log - 
date and other unnecessary information removed):

te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, 
id=(null), magic=NA, cib=0.385.1) : Non-status change
Cause <diff crm_feature_set="3.0.6" >
Cause   <diff-removed admin_epoch="0" epoch="384" num_updates="7" >
Cause     <cib admin_epoch="0" epoch="384" num_updates="7" >
Cause       <configuration >
Cause         <nodes >
Cause           <node uname="r3" id="1" />
Cause         </nodes>
Cause       </configuration>
Cause     </cib>
Cause   </diff-removed>
Cause   <diff-added >
Cause     <cib epoch="385" num_updates="1" admin_epoch="0" 
validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="r4" 
update-client="crmd" cib-last-written="Mon Oct 29 13:41:16 2012" 
have-quorum="1" dc-uuid="2" >
Cause       <configuration >
Cause         <nodes >
Cause           <node id="1" uname="r3-eth1" />
Cause         </nodes>
Cause       </configuration>
Cause     </cib>
Cause   </diff-added>
Cause </diff>

which appears to remove the node from the CIB.

In the case of r3 as DC, the above doesn't happen, the node remains 
online and is then shortly assigned resources.

Could anyone suggest a reason for the different behaviour in these cases?

Regards,
James


On 10/29/2012 01:51 PM, James Guthrie wrote:
> Hi Michael,
>
> I have managed to successfully configure corosync with udpu, it
> unfortunately hasn't made a difference in the behaviour of the cluster.
>
> I have found that I don't even need to restart the host in order to get
> this behaviour - all I need to do is stop and restart corosync and
> pacemaker on *one* of the hosts. To be precise: I've been able to narrow
> it down to only one of the two hosts (r3). If I reboot the host, or
> restart the services on r4 everything works fine. If I try the same with
> r3, I have problems.
>
> I feel as though the answer may lie in the logfiles, the
> intercommunication between the individual components of the HA software
> makes it a bit difficult to accurately read the logfiles as an outsider
> to this software. I have attached the logs of both r3 and r4 after
> reproducing this effect this afternoon, they are much shorter to read
> than those previously:
>
> corosync-r3.log: http://pastebin.com/ZAhh5nax
> corosync-r4.log: http://pastebin.com/SETtqnZM
>
> Are there any other steps I could take in debugging this behaviour?
>
> Regards,
> James
>
> On 10/26/2012 04:33 PM, Michael Schwartzkopff wrote:
>>> Hi Michael,
>>>
>>> I'm working with a Linux From Scratch based kernel (version 3.4.7)
>>> running in a virtual machine and with virtual switches.
>> (...)
>>> `tcpdump -ni eth1 port 5404` returns:
>>>
>>> listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
>>> 16:22:27.849551 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>> 16:22:28.210578 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>> 16:22:28.770181 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>> 16:22:28.989802 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>> 16:22:29.370684 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>> 16:22:29.751062 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>>
>>> Every now and then there is a packet from r4 (192.168.200.170), it does
>>> appear as though r4 is quite quiet though.
>>
>> Ah. No pakcets from 192.168.200.166 unicast? Please try to configure unicast 
>> in
>> your corosync configuration. See the udpu README file of corosync.
>>
>> I had the same problem and the cause was the the virtual bridge or KVM 
>> dropped
>> all multicast packets.
>>
>> Greetings,
>>
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node remains offline after host restart

Reply via email to