Re: [Linux-HA] Node remains offline after host restart

Andrew Beekhof Wed, 31 Oct 2012 17:36:26 -0700

On Wed, Oct 31, 2012 at 11:11 PM, James Guthrie <[email protected]> wrote:
> Hi all,
>
> it appears as though this is the problem. The /etc/hosts file specifies
> per-interface hostnames e.g.
>
> 192.168.200.170         r4-eth1
>
> This explains the difference in the hostname that appears to be causing
> a problem.


Do all the nodes have that mapping though?

>
> I have used a nodelist to specify the nodes of the cluster, their ids
> and their names. This seems to have resolved the problem. I haven't been
> able to do enough definitive testing.
>
> The "nodelist" feature is entirely undocumented, a look at the source
> code confirmed that there was in fact a "name" field that would be
> looked for in the config. When will the documentation be updated?

The focus is slowly shifting to documentation now.  This is one area
in particular that needs documenting.

>
> I understand that the logs were displaying the warning signs of
> something being wrong with the configuration, but it wasn't really
> enough to be able to source the problem. Maybe this could be looked into?

Absolutely.  Ideally we'd be able to make it "just work" without the
nodelist even.
But I need to get my head around your configuration first :)

>
> Regards,
> James
>
>
> On 10/30/2012 01:03 PM, Michael Schwartzkopff wrote:
>>> Hi Michael,
>>>
>>> I have managed to successfully configure corosync with udpu, it
>>> unfortunately hasn't made a difference in the behaviour of the cluster.
>>>
>>> I have found that I don't even need to restart the host in order to get
>>> this behaviour - all I need to do is stop and restart corosync and
>>> pacemaker on *one* of the hosts. To be precise: I've been able to narrow
>>> it down to only one of the two hosts (r3). If I reboot the host, or
>>> restart the services on r4 everything works fine. If I try the same with
>>> r3, I have problems.
>>>
>>> I feel as though the answer may lie in the logfiles, the
>>> intercommunication between the individual components of the HA software
>>> makes it a bit difficult to accurately read the logfiles as an outsider
>>> to this software. I have attached the logs of both r3 and r4 after
>>> reproducing this effect this afternoon, they are much shorter to read
>>> than those previously:
>>>
>>> corosync-r3.log: http://pastebin.com/ZAhh5nax
>>> corosync-r4.log: http://pastebin.com/SETtqnZM
>>>
>>> Are there any other steps I could take in debugging this behaviour?
>>>
>>> Regards,
>>> James
>>
>> hi,
>>
>> I think you have a problem in the nameing of your clusters. In the first log
>> it learns the name from DNS:
>>
>> Oct 29 13:41:14 [21723] r3       crmd:   notice: corosync_node_name:
>>   Inferred node name 'r4-eth1' for nodeid 2 from DNS
>>
>> if that does not fit to the name of the node it might cause the problems.
>>
>> Greetings,
>>
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node remains offline after host restart

Reply via email to