Re: [Linux-HA] Other node not found

Amos Shapira Wed, 28 Nov 2007 02:19:20 -0800

On 28/11/2007, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
>
> On Nov 28, 2007, at 5:41 AM, Amos Shapira wrote:
>
> > Hello,
> >
> > I've been trying to follow the instructions in
> > http://wiki.centos.org/HowTos/Ha-Drbd to setup a basic DRBD test
> > cluster on top of a couple of Xen guests.
> >
> > In case this matters, both Xen guests and Xen Dom0 run CentOS 5 which
> > comes with Heartbeat 2.1.2-3 and DRBD 8.0.6.
> >
> > Another point is that my colleague is setting up a test on another
> > couple of tests so I am careful not to use broadcast on the default
> > port (either using a different port or using mcast).
> >
> > What I see is that the primary node comes up but never sees the
> > secondary one.
> >
> > Here is /etc/ha.d/ha.cf so far (not resources configured yet in CRM):
> > ----8
> > <
> > ---------------------------------------------------------------------------
> > keepalive 2
> > deadtime 30
> > warntime 10
> > initdead 120
> > udpport 695
> > bcast   eth0
> > node    drbd01.test
> > node    drbd02.test
> > crm yes
> > ----8
> > <
> > ---------------------------------------------------------------------------
> >
> > And when I start heartbeat and crm_mon it shows (after about a
> > minute):
> > ----8
> > <
> > ---------------------------------------------------------------------------
> > Refresh in 1s...
> >
> > ============
> > Last updated: Thu Nov 29 02:25:55 2007
> > Current DC: drbd01.test (580f615c-2921-4ba6-b03e-9a6e179d9ae2)
> > 1 Nodes configured.
> > 0 Resources configured.
> > ============
> >
> > Node: drbd01.test (580f615c-2921-4ba6-b03e-9a6e179d9ae2): online
> > ----8
> > <
> > ---------------------------------------------------------------------------
> > The instructions in the above-mentioned link say that both nodes
> > should have been displayed (of course I copied the files over to the
> > other node and started heartbeat on it as well).
> >
> > The other node never gets out of a state of "Not connected: Refresh
> > in 2s..."
> >
> > I see the traffic going between the nodes in tcpdump. There is no
> > firewall on the hosts.
> >
> > Both host names resolve properly.
> >
> > What am I missing?
>
> logs from the other node that might allow us to see what the problem is


Here is the current ha.cf (I tweaked it a little since I asked the
question, mostly switched to ucast and added a pingd respawn
directive).

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility     local0

keepalive 2
deadtime 30
warntime 10
initdead 120
udpport 695
#bcast   eth0
ucast   eth0 drbd01.test.spammatters.local
ucast   eth0 drbd02.test.spammatters.local
respawn root /usr/lib64/heartbeat/pingd -m 100 -d 5s
node    drbd01.test.spammatters.local
node    drbd02.test.spammatters.local
crm yes

And here is the log from the node which doesn't join the cluster:

heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp port 695
reserved for service "ieee-mms-ssl".
heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2 support: yes
heartbeat[17481]: 2007/11/29_07:12:40 WARN: File /etc/ha.d/haresources exists.
heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is not used
because crm is enabled
heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[17481]: 2007/11/29_07:12:40 info: **************************
heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration validated.
Starting heartbeat 2.1.2
heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat: version 2.1.2
heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat generation: 1196102397
heartbeat[17482]: 2007/11/29_07:12:40 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[17482]: 2007/11/29_07:12:40 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[17482]: 2007/11/29_07:12:40 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound send
socket to device: eth0
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound receive
socket to device: eth0
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: started on
port 695 interface eth0 to 192.168.0.248
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound send
socket to device: eth0
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound receive
socket to device: eth0
heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: started on
port 695 interface eth0 to 192.168.0.249
heartbeat[17482]: 2007/11/29_07:12:40 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[17482]: 2007/11/29_07:12:40 info: Local status now set to: 'up'
heartbeat[17482]: 2007/11/29_07:12:41 info: Link
drbd01.test.spammatters.local:eth0 up.
heartbeat[17482]: 2007/11/29_07:12:41 info: Status update for node
drbd01.test.spammatters.local: status up

One thing I noticed is that  /var/lib/heartbeat/hostcache on drbd01
(the "master", which doesn't see the slave) contains:

drbd01.test.spammatters.local   580f615c-2921-4ba6-b03e-9a6e179d9ae2    100
drbd02.test.spammatters.local   00000000-0000-0000-0000-000000000000    100

while the one on drbd02 (the "slave", which doesn't join the cluster,
as far as I understand the terminology) it is:

drbd01.test.spammatters.local   580f615c-2921-4ba6-b03e-9a6e179d9ae2    100
drbd02.test.spammatters.local   9e412388-5551-46ae-84d2-a916ac05a8c1    100

I tried removing it from drbd01 but it keeps popping up with the same
content, probably being written by the resource manager?

Here is the content of /var/lib/heartbeat/crm/cib.xml on drbd01:

 <cib generated="false" admin_epoch="0" epoch="0" num_updates="0"
have_quorum="true" ignore_dtd="false" num_peers="0"
cib-last-written="Thu Nov 29 07:14:41 2007" ccm_transition="1">
   <configuration>
     <crm_config/>
     <nodes>
       <node id="580f615c-2921-4ba6-b03e-9a6e179d9ae2"
uname="drbd01.test.spammatters.local" type="normal"/>
     </nodes>
     <resources/>
     <constraints/>
   </configuration>
 </cib>

On drbd02 this entire directory is completely empty.

Also - the shutdown of heartbeat on drbd02 ("service heartbeat stop")
almost always hangs (90+% of the time).

Thanks very much for your help.

--Amos
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Other node not found

Reply via email to