[Linux-HA] one of three heartbeat Links always dead

[email protected] Thu, 06 Nov 2014 07:29:56 -0800




Hi, all,
Recently I have encountered a problem in our production evironment, I googled 
for a long while but failed to find an answer. ?Please help me.
Here is our configuration: we have two nodes (d02 and d03, centos 6.4 
,?heartbeat-3.0.4-1.el6.x86_64) set as heartbeat peers, each has six GigE 
interfaces, two bonded as one, and we got three links: bond0, bond1 and 
bond2.?bond0 and bond1 use bonding mode 6 (alb), and bond2 configured as bond 
mode 0. ?all interfaces connected through a switch. ?Below is the heartbeat 
configurations:
d02 : ha.cf
logfacility  local7

keepalive 2

deadtime  30

initdead  120

node d02 d03

ucast bond0 10.1.205.3

ucast bond1 172.1.1.3

ucast bond2 192.168.128.3

auto_failback off

respawn root /usr/lib64/heartbeat/dopd

apiauth dopd uid=root gid=root

ping_group mdsha 10.1.205.254

respawn root /usr/lib64/heartbeat/ipfail

apiauth ipfail uid=root gid=root
d03: ha.cf
logfacility  local7

keepalive 2

deadtime  30

initdead  120

node d02 d03

ucast bond0 10.1.205.2

ucast bond1 172.1.1.2

ucast bond2 192.168.138.2

auto_failback off

respawn root /usr/lib64/heartbeat/dopd

apiauth dopd uid=root gid=root

ping_group mdsha 10.1.205.254

respawn root /usr/lib64/heartbeat/ipfail

apiauth ipfail uid=root gid=root
haresources:d02 IPaddr::172.1.1.100/24/bond1 drbddisk::r0 
Filesystem::/dev/drbd0::/mnt/drbd mdsinit # mdsinit is used to start our 
applications.
The problem is:This environment run without any problem during past year, while 
d02 was the master. Yestoday, we had to switch the master to d03, so, we 
stopped the heartbeat service of d02, and the master failovered to d03 as we 
expected, our application got up, and everything seemed ok, , but when I check 
status of the three heartbeat links using below command line from d03, I got 
this:? ??? ? ?[root@d03 ~]# cl_status hblinkstatus d02 bond0?? ? ?up

? ? [root@d03 ~]# cl_status hblinkstatus d02 bond1

? ? up

? ? [root@d03 ~]# cl_status hblinkstatus d02 bond2

? ? dead
I tried the same commands in d02, all links got up. Since the environment run 
correctly before failover, I did't think it was a problem of networks, but I 
still check the networks and found iptable service in d02 was opened, and we 
could not reach d02 from d03 through bond2, OK, that' might be the problem, I 
stop the iptables service and tried again, but nothing changed, bond2 still 
dead.?after doingthat, I also restarted the heartbeat service of d02(which was 
not master at that time) and?nothing changed( I was not allowed to restart 
heartbeat service of d03, since the application was in service.). ?There's also 
no strange things in system log (Attached below.).?
So, would you guys please?tell me what can I do to solve this problem without 
affecting the heartbeat service in d03? ? Or, if you need more information, 
please don't hesitate to let me know, I will reply as soon as possible. ?
Thanks.
Hu
d03:/var/log/message
Sep 30 12:00:50 d03 heartbeat: [3628]: info: d03 Heartbeat shutdown complete.

Sep 30 12:01:04 d03 heartbeat: [15767]: info: Pacemaker support: false

Sep 30 12:01:04 d03 heartbeat: [15767]: WARN: Logging daemon is disabled 
--enabling logging daemon is recommended

Sep 30 12:01:04 d03 heartbeat: [15767]: info: **************************

Sep 30 12:01:04 d03 heartbeat: [15767]: info: Configuration validated. Starting 
heartbeat 3.0.4

Sep 30 12:01:04 d03 heartbeat: [15768]: info: heartbeat: version 3.0.4

Sep 30 12:01:04 d03 heartbeat: [15768]: info: Heartbeat generation: 1372996321

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on bond0

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound send socket to 
device: bond0

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound receive socket 
to device: bond0

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: started on port 694 
interface bond0 to 10.1.205.2

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on bond1

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound send socket to 
device: bond1

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound receive socket 
to device: bond1

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: started on port 694 
interface bond1 to 172.1.1.2

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on bond2

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound send socket to 
device: bond2

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound receive socket 
to device: bond2

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: started on port 694 
interface bond2 to 192.168.138.2

Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ping group heartbeat 
started.

Sep 30 12:01:04 d03 heartbeat: [15768]: info: G_main_add_TriggerHandler: Added 
signal manual handler

Sep 30 12:01:04 d03 heartbeat: [15768]: info: G_main_add_TriggerHandler: Added 
signal manual handler

Sep 30 12:01:04 d03 heartbeat: [15768]: info: G_main_add_SignalHandler: Added 
signal handler for signal 17

Sep 30 12:01:04 d03 heartbeat: [15768]: info: Local status now set to: 'up'

Sep 30 12:01:05 d03 heartbeat: [15768]: info: Link mdsha:mdsha up.

Sep 30 12:01:05 d03 heartbeat: [15768]: info: Status update for node mdsha: 
status ping

Sep 30 12:01:06 d03 kernel: block drbd0: peer( Primary -> Secondary )

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Link d02:bond0 up.

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Status update for node d02: 
status active

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Link d02:bond1 up.

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Received shutdown notice from 
'd02'.

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Resources being acquired from d02.

Sep 30 12:01:06 d03 heartbeat: [15783]: info: acquire all HA resources 
(standby).

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Comm_now_up(): updating status to 
active

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Local status now set to: 'active'

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Starting child client 
"/usr/lib64/heartbeat/dopd" (0,0)

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Starting child client 
"/usr/lib64/heartbeat/ipfail" (0,0)

Sep 30 12:01:06 d03 heartbeat: [15788]: info: Starting 
"/usr/lib64/heartbeat/dopd" as uid 0  gid 0 (pid 15788)

Sep 30 12:01:06 d03 heartbeat: [15789]: info: Starting 
"/usr/lib64/heartbeat/ipfail" as uid 0  gid 0 (pid 15789)

Sep 30 12:01:06 d03 heartbeat: [15784]: info: No local resources 
[/usr/share/heartbeat/ResourceManager listkeys d03] to acquire.

Sep 30 12:01:06 d03 heartbeat: [15768]: info: Initial resource acquisition 
complete (T_RESOURCES(us))Sep 30 12:01:06 d03 harc(default)[15782]: info: 
Running /etc/ha.d//rc.d/status status

Sep 30 12:01:06 d03 ResourceManager(default)[15820]: info: Acquiring resource 
group: d02 IPaddr::172.1.1.100/24/bond1 drbddisk::r0 
Filesystem::/dev/drbd0::/mnt/drbd mds

Sep 30 12:01:06 d03 
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.1.1.100)[15858]: INFO:  
Resource is stopped

Sep 30 12:01:06 d03 ResourceManager(default)[15820]: info: Running 
/etc/ha.d/resource.d/IPaddr 172.1.1.100/24/bond1 start

Sep 30 12:01:06 d03 IPaddr(IPaddr_172.1.1.100)[15943]: INFO: Using calculated 
netmask for 172.1.1.100: 255.255.255.0

Sep 30 12:01:07 d03 IPaddr(IPaddr_172.1.1.100)[15943]: INFO: eval ifconfig 
bond1:0 172.1.1.100 netmask 255.255.255.0 broadcast 172.1.1.255

Sep 30 12:01:07 d03 
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.1.1.100)[15917]: INFO:  
Success

Sep 30 12:01:07 d03 ResourceManager(default)[15820]: info: Running 
/etc/ha.d/resource.d/drbddisk r0 start

Sep 30 12:01:07 d03 kernel: block drbd0: role( Secondary -> Primary )

Sep 30 12:01:07 d03 
/usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[16079]: 
INFO:  Resource is stopped

Sep 30 12:01:07 d03 ResourceManager(default)[15820]: info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/drbd start

Sep 30 12:01:07 d03 Filesystem(Filesystem_/dev/drbd0)[16159]: INFO: Running 
start for /dev/drbd0 on /mnt/drbd

Sep 30 12:01:07 d03 Filesystem(Filesystem_/dev/drbd0)[16159]: INFO: Starting 
filesystem check on /dev/drbd0

Sep 30 12:01:08 d03 ntpd[2582]: Listening on interface #12 bond1:0, 
172.1.1.100#123 Enabled

Sep 30 12:01:09 d03 kernel: EXT4-fs (drbd0): mounted filesystem with ordered 
data mode. Opts:

Sep 30 12:01:09 d03 
/usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[16150]: 
INFO:  Success

Sep 30 12:01:09 d03 ResourceManager(default)[15820]: info: Running 
/etc/ha.d/resource.d/mds  start

Sep 30 12:01:10 d03 heartbeat: [15783]: info: all HA resource acquisition 
completed (standby).

Sep 30 12:01:10 d03 heartbeat: [15768]: info: Standby resource acquisition done 
[all].

Sep 30 12:01:10 d03 harc(default)[16324]: info: Running /etc/ha.d//rc.d/status 
status

Sep 30 12:01:11 d03 ipfail: [15789]: info: Ping node count is balanced.

Sep 30 12:01:12 d03 mach_down(default)[16341]: info: Taking over resource group 
IPaddr::172.1.1.100/24/bond1

Sep 30 12:01:12 d03 ResourceManager(default)[16368]: info: Acquiring resource 
group: d02 IPaddr::172.1.1.100/24/bond1 drbddisk::r0 
Filesystem::/dev/drbd0::/mnt/drbd mds

Sep 30 12:01:12 d03 
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.1.1.100)[16396]: INFO:  
Running OK

Sep 30 12:01:12 d03 
/usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[16468]: 
INFO:  Running OK

Sep 30 12:01:12 d03 mach_down(default)[16341]: info: 
/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired

Sep 30 12:01:12 d03 mach_down(default)[16341]: info: mach_down takeover 
complete for node d02.

Sep 30 12:01:12 d03 heartbeat: [15768]: info: mach_down takeover complete.

Sep 30 12:01:17 d03 heartbeat: [15768]: info: Local Resource acquisition 
completed. (none)

Sep 30 12:01:17 d03 heartbeat: [15768]: info: local resource transition 
completed.

Sep 30 12:01:39 d03 heartbeat: [15768]: WARN: node d02: is dead

Sep 30 12:01:39 d03 heartbeat: [15768]: info: Dead node d02 gave up resources.

Sep 30 12:01:39 d03 ipfail: [15789]: info: Status update: Node d02 now has 
status dead

Sep 30 12:01:39 d03 heartbeat: [15768]: info: Link d02:bond0 dead.

Sep 30 12:01:39 d03 heartbeat: [15768]: info: Link d02:bond1 dead.

Sep 30 12:01:40 d03 ipfail: [15789]: info: NS: We are still alive!

Sep 30 12:01:40 d03 ipfail: [15789]: info: Link Status update: Link d02/bond0 
now has status dead

Sep 30 12:01:42 d03 ipfail: [15789]: info: Asking other side for ping node 
count.

Sep 30 12:01:42 d03 ipfail: [15789]: info: Checking remote count of ping nodes.

Sep 30 12:01:42 d03 ipfail: [15789]: info: Link Status update: Link d02/bond1 
now has status dead

Sep 30 12:01:43 d03 ipfail: [15789]: info: Asking other side for ping node 
count.

Sep 30 12:01:43 d03 ipfail: [15789]: info: Checking remote count of ping 
nodes.Sep 30 12:15:21 d03 smbd[17096]: [2014/09/30 12:15:21.942313,  0] 
smbd/process.c:2440(keepalive_fn)

Sep 30 12:15:21 d03 smbd[17096]:   send_keepalive failed for client 0.0.0.0. 
Error Broken pipe - exiting

Sep 30 13:13:16 d03 heartbeat: [15768]: info: Heartbeat restart on node d02

Sep 30 13:13:16 d03 heartbeat: [15768]: info: Link d02:bond0 up.

Sep 30 13:13:16 d03 heartbeat: [15768]: info: Status update for node d02: 
status init

Sep 30 13:13:16 d03 ipfail: [15789]: info: Link Status update: Link d02/bond0 
now has status up

Sep 30 13:13:16 d03 ipfail: [15789]: info: Status update: Node d02 now has 
status init

Sep 30 13:13:16 d03 heartbeat: [15768]: info: Link d02:bond1 up.

Sep 30 13:13:16 d03 ipfail: [15789]: info: Link Status update: Link d02/bond1 
now has status up

Sep 30 13:13:16 d03 heartbeat: [15768]: info: Status update for node d02: 
status up

Sep 30 13:13:16 d03 ipfail: [15789]: info: Status update: Node d02 now has 
status up

Sep 30 13:13:16 d03 harc(default)[29476]: info: Running /etc/ha.d//rc.d/status 
status

Sep 30 13:13:16 d03 harc(default)[29493]: info: Running /etc/ha.d//rc.d/status 
status

Sep 30 13:13:17 d03 heartbeat: [15768]: info: all clients are now paused

Sep 30 13:13:18 d03 heartbeat: [15768]: info: Status update for node d02: 
status active

Sep 30 13:13:18 d03 ipfail: [15789]: info: Status update: Node d02 now has 
status active

Sep 30 13:13:18 d03 harc(default)[29510]: info: Running /etc/ha.d//rc.d/status 
status

Sep 30 13:13:19 d03 heartbeat: [15768]: info: remote resource transition 
completed.

Sep 30 13:13:25 d03 heartbeat: [15768]: info: all clients are now resumed

Sep 30 13:13:27 d03 ipfail: [15789]: info: Asking other side for ping node 
count.

Sep 30 13:13:29 d03 ipfail: [15789]: info: No giveup timer to abort.
d02:/var/log/messageSep 30 13:13:12 d02 heartbeat: [18371]: info: Pacemaker 
support: false

Sep 30 13:13:12 d02 heartbeat: [18371]: WARN: Logging daemon is disabled 
--enabling logging daemon is recommended

Sep 30 13:13:12 d02 heartbeat: [18371]: info: **************************

Sep 30 13:13:12 d02 heartbeat: [18371]: info: Configuration validated. Starting 
heartbeat 3.0.4

Sep 30 13:13:12 d02 heartbeat: [18372]: info: heartbeat: version 3.0.4

Sep 30 13:13:12 d02 heartbeat: [18372]: info: Heartbeat generation: 1372996313

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on bond0

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound send socket to 
device: bond0

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound receive socket 
to device: bond0

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: started on port 694 
interface bond0 to 10.1.205.3

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on bond1

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound send socket to 
device: bond1

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound receive socket 
to device: bond1

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: started on port 694 
interface bond1 to 172.1.1.3

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: write socket 
priority set to IPTOS_LOWDELAY on bond2

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound send socket to 
device: bond2

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound receive socket 
to device: bond2

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: started on port 694 
interface bond2 to 192.168.128.3

Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ping group heartbeat 
started.

Sep 30 13:13:12 d02 heartbeat: [18372]: info: G_main_add_TriggerHandler: Added 
signal manual handler

Sep 30 13:13:12 d02 heartbeat: [18372]: info: G_main_add_TriggerHandler: Added 
signal manual handler

Sep 30 13:13:12 d02 heartbeat: [18372]: info: G_main_add_SignalHandler: Added 
signal handler for signal 17

Sep 30 13:13:12 d02 heartbeat: [18372]: info: Local status now set to: 'up'

Sep 30 13:13:13 d02 heartbeat: [18372]: info: Link mdsha:mdsha up.

Sep 30 13:13:13 d02 heartbeat: [18372]: info: Status update for node mdsha: 
status ping

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Link d03:bond0 up.

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Status update for node d03: 
status active

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Link d03:bond1 up.

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Link d03:bond2 up.

Sep 30 13:13:14 d02 harc(default)[18387]: info: Running /etc/ha.d//rc.d/status 
status

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Comm_now_up(): updating status to 
active

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Local status now set to: 'active'

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Starting child client 
"/usr/lib64/heartbeat/dopd" (0,0)

Sep 30 13:13:14 d02 heartbeat: [18372]: info: Starting child client 
"/usr/lib64/heartbeat/ipfail" (0,0)

Sep 30 13:13:14 d02 heartbeat: [18405]: info: Starting 
"/usr/lib64/heartbeat/dopd" as uid 0  gid 0 (pid 18405)

Sep 30 13:13:14 d02 heartbeat: [18406]: info: Starting 
"/usr/lib64/heartbeat/ipfail" as uid 0  gid 0 (pid 18406)

Sep 30 13:13:15 d02 heartbeat: [18372]: info: remote resource transition 
completed.

Sep 30 13:13:15 d02 heartbeat: [18372]: info: remote resource transition 
completed.

Sep 30 13:13:15 d02 heartbeat: [18372]: info: Local Resource acquisition 
completed. (none)

Sep 30 13:13:15 d02 heartbeat: [18372]: info: Initial resource acquisition 
complete (T_RESOURCES(them))

Sep 30 13:13:25 d02 ipfail: [18406]: info: Ping node count is balanced.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
[Linux-HA] one of three heartbeat Links always dead

Reply via email to