Re: [Linux-HA] Other node not found

Dejan Muhamedagic Wed, 28 Nov 2007 03:19:35 -0800

Hi,

On Wed, Nov 28, 2007 at 09:19:18AM +0000, Amos Shapira wrote:
> On 28/11/2007, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >
> > On Nov 28, 2007, at 5:41 AM, Amos Shapira wrote:
> >
> > > Hello,
> > >
> > > I've been trying to follow the instructions in
> > > http://wiki.centos.org/HowTos/Ha-Drbd to setup a basic DRBD test
> > > cluster on top of a couple of Xen guests.
> > >
> > > In case this matters, both Xen guests and Xen Dom0 run CentOS 5 which
> > > comes with Heartbeat 2.1.2-3 and DRBD 8.0.6.
> > >
> > > Another point is that my colleague is setting up a test on another
> > > couple of tests so I am careful not to use broadcast on the default
> > > port (either using a different port or using mcast).
> > >
> > > What I see is that the primary node comes up but never sees the
> > > secondary one.
> > >
> > > Here is /etc/ha.d/ha.cf so far (not resources configured yet in CRM):
> > > ----8
> > > <
> > > ---------------------------------------------------------------------------
> > > keepalive 2
> > > deadtime 30
> > > warntime 10
> > > initdead 120
> > > udpport 695
> > > bcast   eth0
> > > node    drbd01.test
> > > node    drbd02.test
> > > crm yes
> > > ----8
> > > <
> > > ---------------------------------------------------------------------------
> > >
> > > And when I start heartbeat and crm_mon it shows (after about a
> > > minute):
> > > ----8
> > > <
> > > ---------------------------------------------------------------------------
> > > Refresh in 1s...
> > >
> > > ============
> > > Last updated: Thu Nov 29 02:25:55 2007
> > > Current DC: drbd01.test (580f615c-2921-4ba6-b03e-9a6e179d9ae2)
> > > 1 Nodes configured.
> > > 0 Resources configured.
> > > ============
> > >
> > > Node: drbd01.test (580f615c-2921-4ba6-b03e-9a6e179d9ae2): online
> > > ----8
> > > <
> > > ---------------------------------------------------------------------------
> > > The instructions in the above-mentioned link say that both nodes
> > > should have been displayed (of course I copied the files over to the
> > > other node and started heartbeat on it as well).
> > >
> > > The other node never gets out of a state of "Not connected: Refresh
> > > in 2s..."
> > >
> > > I see the traffic going between the nodes in tcpdump. There is no
> > > firewall on the hosts.
> > >
> > > Both host names resolve properly.
> > >
> > > What am I missing?
> >
> > logs from the other node that might allow us to see what the problem is
> 
> Here is the current ha.cf (I tweaked it a little since I asked the
> question, mostly switched to ucast and added a pingd respawn
> directive).
> 
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> logfacility     local0
> 
> keepalive 2
> deadtime 30
> warntime 10
> initdead 120
> udpport 695
> #bcast   eth0
> ucast   eth0 drbd01.test.spammatters.local
> ucast   eth0 drbd02.test.spammatters.local
> respawn root /usr/lib64/heartbeat/pingd -m 100 -d 5s
> node    drbd01.test.spammatters.local
> node    drbd02.test.spammatters.local
> crm yes
> 
> And here is the log from the node which doesn't join the cluster:
> 
> heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp port 695
> reserved for service "ieee-mms-ssl".
> heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2 support: yes
> heartbeat[17481]: 2007/11/29_07:12:40 WARN: File /etc/ha.d/haresources exists.
> heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is not used
> because crm is enabled
> heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon is disabled
> --enabling logging daemon is recommended
> heartbeat[17481]: 2007/11/29_07:12:40 info: **************************
> heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration validated.
> Starting heartbeat 2.1.2
> heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat: version 2.1.2
> heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat generation: 1196102397
> heartbeat[17482]: 2007/11/29_07:12:40 info: G_main_add_TriggerHandler:
> Added signal manual handler
> heartbeat[17482]: 2007/11/29_07:12:40 info: G_main_add_TriggerHandler:
> Added signal manual handler
> heartbeat[17482]: 2007/11/29_07:12:40 info: Removing
> /var/run/heartbeat/rsctmp failed, recreating.
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth0
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound send
> socket to device: eth0
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound receive
> socket to device: eth0
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: started on
> port 695 interface eth0 to 192.168.0.248
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth0
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound send
> socket to device: eth0
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound receive
> socket to device: eth0
> heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: started on
> port 695 interface eth0 to 192.168.0.249
> heartbeat[17482]: 2007/11/29_07:12:40 info: G_main_add_SignalHandler:
> Added signal handler for signal 17
> heartbeat[17482]: 2007/11/29_07:12:40 info: Local status now set to: 'up'
> heartbeat[17482]: 2007/11/29_07:12:41 info: Link
> drbd01.test.spammatters.local:eth0 up.
> heartbeat[17482]: 2007/11/29_07:12:41 info: Status update for node
> drbd01.test.spammatters.local: status up
> 
> One thing I noticed is that  /var/lib/heartbeat/hostcache on drbd01
> (the "master", which doesn't see the slave) contains:
> 
> drbd01.test.spammatters.local   580f615c-2921-4ba6-b03e-9a6e179d9ae2    100
> drbd02.test.spammatters.local   00000000-0000-0000-0000-000000000000    100
> 
> while the one on drbd02 (the "slave", which doesn't join the cluster,
> as far as I understand the terminology) it is:
> 
> drbd01.test.spammatters.local   580f615c-2921-4ba6-b03e-9a6e179d9ae2    100
> drbd02.test.spammatters.local   9e412388-5551-46ae-84d2-a916ac05a8c1    100
> 
> I tried removing it from drbd01 but it keeps popping up with the same
> content, probably being written by the resource manager?
> 
> Here is the content of /var/lib/heartbeat/crm/cib.xml on drbd01:
> 
>  <cib generated="false" admin_epoch="0" epoch="0" num_updates="0"
> have_quorum="true" ignore_dtd="false" num_peers="0"
> cib-last-written="Thu Nov 29 07:14:41 2007" ccm_transition="1">
>    <configuration>
>      <crm_config/>
>      <nodes>
>        <node id="580f615c-2921-4ba6-b03e-9a6e179d9ae2"
> uname="drbd01.test.spammatters.local" type="normal"/>
>      </nodes>
>      <resources/>
>      <constraints/>
>    </configuration>
>  </cib>
> 
> On drbd02 this entire directory is completely empty.


It looks like a communication problem. Are you sure that the
udp/695 port is open?

Thanks,

Dejan

> Also - the shutdown of heartbeat on drbd02 ("service heartbeat stop")
> almost always hangs (90+% of the time).
> 
> Thanks very much for your help.
> 
> --Amos
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Other node not found

Reply via email to