On Fri, Apr 16, 2010 at 03:53:11PM -0500, Patrick Cotner wrote:
>  
> 
> > On 04/16/2010 07:04 PM, Patrick Cotner wrote:
> > > cl_status hblinkstatus san02 eth2 reports 'dead' from san01.  
> > > The interface is up and I can ping from either side.
> > > I can't figure out why heartbeat thinks this interface is 
> > dead and I'm
> > > not sure what I need to do next inorder to resolve it.
> > > 
> > > Basic setup:
> > > Heartbeat 3.0.2 on debian lenny, two nodes: san01, san02
> > > 
> > > +--------+     +--------+
> > > |  eth1  |<--->|  eth1  |  
> > > | san01  |     | san02  |
> > > |  eth2  |<--->|  eth2  |
> > > +--------+     +--------+  
> > > 
> > > /etc/ha.d/ha.cf:
> > > use_logd on
> > > debug 1
> > > autojoin none
> > > bcast eth1
> > > bcast eth2
> > > initdead 30
> > > keepalive 1
> > > warntime 5
> > > deadtime 10
> > > node san01
> > > node san02
> > > crm yes
> > > 
> > > cl_status results when issued from san01:
> > > cl_status hbstatus   >> Heartbeat is running on this machine.
> > > cl_status listnodes  >>  san02 san01  
> > > cl_status listhblinks san01  >> eth2 eth1
> > > cl_status listhblinks san02  >> eth2 eth1
> > > cl_status hblinkstatus san01 eth1  >> up
> > > cl_status hblinkstatus san01 eth2  >> up
> > > cl_status hblinkstatus san02 eth1  >> dead <<  this is the problem
> > > cl_status hblinkstatus san02 eth2  >> up
> > > 
> > > cl_status results when issued from san02:
> > > cl_status hbstatus   >> Heartbeat is running on this machine.
> > > cl_status listnodes  >>  san02 san01  
> > > cl_status listhblinks san01  >> eth2 eth1
> > > cl_status listhblinks san02  >> eth2 eth1
> > > cl_status hblinkstatus san01 eth1  >> up
> > > cl_status hblinkstatus san01 eth2  >> up
> > > cl_status hblinkstatus san02 eth1  >> up
> > > cl_status hblinkstatus san02 eth2  >> up
> > > 
> > > Can anyone give me any other avenues to troubleshoot?
> > > Let me know if I need to provide any more information regarding my
> > > setup.
> > 
> > Typical cause would be a local firewall blocking incoming UDP port 694
> > on eth1 on san01 only.
> > 
> > Cheers,
> > Florian
> > 
> 
> Florian, thanks for the reply.
> I don't think that's it as my iptables are completely empty on both
> nodes:
> 
> san01:~# iptables -L

well, I'd always check with iptables-save,
as iptables -L does only list one table
(you may have something strange in the mangle or nat tables
 which is overlooked)

But ok.
Let's assume there is no iptables rule in the way.

> Are there any atypical causes?

Is there anything in the logs?
It should have logged it when the link went "down".
What else has happened at that time
 within the cluster subsystem
 within the systems
 within the network?

As you use "bcast " statements,
are you sure they agree on the broadcast network address?
# ip addr show dev eth1; ip route show dev eth1

Also,
# ps faux | grep heartbeat
should show you something like
 heartbeat: write: bcast eth1
 heartbeat: read: bcast eth1

You could strace them, to find out what they are doing.
If you feel like it, you can even kill them (they will be restarted).
Just don't kill all of them at the same time.
And don't kill the master controll process.
Or you won't be happy with the results.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to