Re: [Linux-HA] Node remains offline after host restart

Andrew Beekhof Wed, 31 Oct 2012 17:34:08 -0700

On Tue, Oct 30, 2012 at 7:11 PM, James Guthrie <[email protected]> wrote:
> Hi Andrew,
>
> In which category should I file the bug? Based on my issues I'm assuming
> "Pacemaker" > "Other" or maybe "Linux-HA" > "CRM Misc."?


http://bugs.clusterlabs.org/enter_bug.cgi?product=Pacemaker and then "Core"

>
> I seem to be unable to use crm_report as my install is a "Non-standard
> Pacemaker installation",

Really? How did you install?  Usually can can find things anyway.

> the documentation doesn't suggest that there's
> the possibility to give a path at which the required files can be found.
> Does it make sense to manually put the files together?

Sure.  In your case, I mostly need the logs and the corosync config.

>
> Regards,
> James
>
> On 10/30/2012 05:55 AM, Andrew Beekhof wrote:
>> Can you file a bug for this and include a crm_report tarball?
>> It sounds like there is a mismatch in the way node name is being
>> detected/calculated - which could either be a bug or a
>> misconfiguration.
>>
>> On Tue, Oct 30, 2012 at 12:46 AM, James Guthrie <[email protected]> wrote:
>>> Hi all,
>>>
>>> As mentioned in my previous e-mail, I get different results with
>>> different nodes as DC. I have now compiled a logfile when using r3 as
>>> DC, which is the case that always works. I looked into the difference
>>> between this situation and the previous logfiles. In both instances the
>>> same action is triggered but something different happens in both cases.
>>>
>>> corosync-r3-DC.log: http://pastebin.com/axSRfzEJ
>>> corosync-r4-DC.log: http://pastebin.com/SETtqnZM
>>>
>>> On line 567 of r3-DC.log and 572 of r4-DC.log the same thing happens:
>>>
>>> crmd:     info: abort_transition_graph:        do_te_invoke:156 -
>>> Triggered transition abort (complete=1) : Peer Cancelled
>>>
>>> With r4 as DC the following takes place (lines 600-620 of r4-DC.log -
>>> date and other unnecessary information removed):
>>>
>>> te_update_diff:126 - Triggered transition abort (complete=1, tag=diff,
>>> id=(null), magic=NA, cib=0.385.1) : Non-status change
>>> Cause <diff crm_feature_set="3.0.6" >
>>> Cause   <diff-removed admin_epoch="0" epoch="384" num_updates="7" >
>>> Cause     <cib admin_epoch="0" epoch="384" num_updates="7" >
>>> Cause       <configuration >
>>> Cause         <nodes >
>>> Cause           <node uname="r3" id="1" />
>>> Cause         </nodes>
>>> Cause       </configuration>
>>> Cause     </cib>
>>> Cause   </diff-removed>
>>> Cause   <diff-added >
>>> Cause     <cib epoch="385" num_updates="1" admin_epoch="0"
>>> validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="r4"
>>> update-client="crmd" cib-last-written="Mon Oct 29 13:41:16 2012"
>>> have-quorum="1" dc-uuid="2" >
>>> Cause       <configuration >
>>> Cause         <nodes >
>>> Cause           <node id="1" uname="r3-eth1" />
>>> Cause         </nodes>
>>> Cause       </configuration>
>>> Cause     </cib>
>>> Cause   </diff-added>
>>> Cause </diff>
>>>
>>> which appears to remove the node from the CIB.
>>>
>>> In the case of r3 as DC, the above doesn't happen, the node remains
>>> online and is then shortly assigned resources.
>>>
>>> Could anyone suggest a reason for the different behaviour in these cases?
>>>
>>> Regards,
>>> James
>>>
>>>
>>> On 10/29/2012 01:51 PM, James Guthrie wrote:
>>>> Hi Michael,
>>>>
>>>> I have managed to successfully configure corosync with udpu, it
>>>> unfortunately hasn't made a difference in the behaviour of the cluster.
>>>>
>>>> I have found that I don't even need to restart the host in order to get
>>>> this behaviour - all I need to do is stop and restart corosync and
>>>> pacemaker on *one* of the hosts. To be precise: I've been able to narrow
>>>> it down to only one of the two hosts (r3). If I reboot the host, or
>>>> restart the services on r4 everything works fine. If I try the same with
>>>> r3, I have problems.
>>>>
>>>> I feel as though the answer may lie in the logfiles, the
>>>> intercommunication between the individual components of the HA software
>>>> makes it a bit difficult to accurately read the logfiles as an outsider
>>>> to this software. I have attached the logs of both r3 and r4 after
>>>> reproducing this effect this afternoon, they are much shorter to read
>>>> than those previously:
>>>>
>>>> corosync-r3.log: http://pastebin.com/ZAhh5nax
>>>> corosync-r4.log: http://pastebin.com/SETtqnZM
>>>>
>>>> Are there any other steps I could take in debugging this behaviour?
>>>>
>>>> Regards,
>>>> James
>>>>
>>>> On 10/26/2012 04:33 PM, Michael Schwartzkopff wrote:
>>>>>> Hi Michael,
>>>>>>
>>>>>> I'm working with a Linux From Scratch based kernel (version 3.4.7)
>>>>>> running in a virtual machine and with virtual switches.
>>>>> (...)
>>>>>> `tcpdump -ni eth1 port 5404` returns:
>>>>>>
>>>>>> listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
>>>>>> 16:22:27.849551 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>>>>> 16:22:28.210578 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>>>>> 16:22:28.770181 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>>>>> 16:22:28.989802 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>>>>> 16:22:29.370684 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>>>>> 16:22:29.751062 IP 192.168.200.166.5404 > 224.0.0.18.5405: UDP, length 87
>>>>>>
>>>>>> Every now and then there is a packet from r4 (192.168.200.170), it does
>>>>>> appear as though r4 is quite quiet though.
>>>>>
>>>>> Ah. No pakcets from 192.168.200.166 unicast? Please try to configure 
>>>>> unicast in
>>>>> your corosync configuration. See the udpu README file of corosync.
>>>>>
>>>>> I had the same problem and the cause was the the virtual bridge or KVM 
>>>>> dropped
>>>>> all multicast packets.
>>>>>
>>>>> Greetings,
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>
>>>>
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node remains offline after host restart

Reply via email to