Hi, all, Recently I have encountered a problem in our production evironment, I googled for a long while but failed to find an answer. ?Please help me. Here is our configuration: we have two nodes (d02 and d03, centos 6.4 ,?heartbeat-3.0.4-1.el6.x86_64) set as heartbeat peers, each has six GigE interfaces, two bonded as one, and we got three links: bond0, bond1 and bond2.?bond0 and bond1 use bonding mode 6 (alb), and bond2 configured as bond mode 0. ?all interfaces connected through a switch. ?Below is the heartbeat configurations: d02 : ha.cf logfacility local7 keepalive 2 deadtime 30 initdead 120 node d02 d03 ucast bond0 10.1.205.3 ucast bond1 172.1.1.3 ucast bond2 192.168.128.3 auto_failback off respawn root /usr/lib64/heartbeat/dopd apiauth dopd uid=root gid=root ping_group mdsha 10.1.205.254 respawn root /usr/lib64/heartbeat/ipfail apiauth ipfail uid=root gid=root d03: ha.cf logfacility local7 keepalive 2 deadtime 30 initdead 120 node d02 d03 ucast bond0 10.1.205.2 ucast bond1 172.1.1.2 ucast bond2 192.168.138.2 auto_failback off respawn root /usr/lib64/heartbeat/dopd apiauth dopd uid=root gid=root ping_group mdsha 10.1.205.254 respawn root /usr/lib64/heartbeat/ipfail apiauth ipfail uid=root gid=root haresources:d02 IPaddr::172.1.1.100/24/bond1 drbddisk::r0 Filesystem::/dev/drbd0::/mnt/drbd mdsinit # mdsinit is used to start our applications. The problem is:This environment run without any problem during past year, while d02 was the master. Yestoday, we had to switch the master to d03, so, we stopped the heartbeat service of d02, and the master failovered to d03 as we expected, our application got up, and everything seemed ok, , but when I check status of the three heartbeat links using below command line from d03, I got this:? ??? ? ?[root@d03 ~]# cl_status hblinkstatus d02 bond0?? ? ?up ? ? [root@d03 ~]# cl_status hblinkstatus d02 bond1 ? ? up ? ? [root@d03 ~]# cl_status hblinkstatus d02 bond2 ? ? dead I tried the same commands in d02, all links got up. Since the environment run correctly before failover, I did't think it was a problem of networks, but I still check the networks and found iptable service in d02 was opened, and we could not reach d02 from d03 through bond2, OK, that' might be the problem, I stop the iptables service and tried again, but nothing changed, bond2 still dead.?after doingthat, I also restarted the heartbeat service of d02(which was not master at that time) and?nothing changed( I was not allowed to restart heartbeat service of d03, since the application was in service.). ?There's also no strange things in system log (Attached below.).? So, would you guys please?tell me what can I do to solve this problem without affecting the heartbeat service in d03? ? Or, if you need more information, please don't hesitate to let me know, I will reply as soon as possible. ? Thanks. Hu d03:/var/log/message Sep 30 12:00:50 d03 heartbeat: [3628]: info: d03 Heartbeat shutdown complete. Sep 30 12:01:04 d03 heartbeat: [15767]: info: Pacemaker support: false Sep 30 12:01:04 d03 heartbeat: [15767]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Sep 30 12:01:04 d03 heartbeat: [15767]: info: ************************** Sep 30 12:01:04 d03 heartbeat: [15767]: info: Configuration validated. Starting heartbeat 3.0.4 Sep 30 12:01:04 d03 heartbeat: [15768]: info: heartbeat: version 3.0.4 Sep 30 12:01:04 d03 heartbeat: [15768]: info: Heartbeat generation: 1372996321 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond0 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound send socket to device: bond0 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound receive socket to device: bond0 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: started on port 694 interface bond0 to 10.1.205.2 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond1 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound send socket to device: bond1 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound receive socket to device: bond1 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: started on port 694 interface bond1 to 172.1.1.2 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond2 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound send socket to device: bond2 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: bound receive socket to device: bond2 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ucast: started on port 694 interface bond2 to 192.168.138.2 Sep 30 12:01:04 d03 heartbeat: [15768]: info: glib: ping group heartbeat started. Sep 30 12:01:04 d03 heartbeat: [15768]: info: G_main_add_TriggerHandler: Added signal manual handler Sep 30 12:01:04 d03 heartbeat: [15768]: info: G_main_add_TriggerHandler: Added signal manual handler Sep 30 12:01:04 d03 heartbeat: [15768]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Sep 30 12:01:04 d03 heartbeat: [15768]: info: Local status now set to: 'up' Sep 30 12:01:05 d03 heartbeat: [15768]: info: Link mdsha:mdsha up. Sep 30 12:01:05 d03 heartbeat: [15768]: info: Status update for node mdsha: status ping Sep 30 12:01:06 d03 kernel: block drbd0: peer( Primary -> Secondary ) Sep 30 12:01:06 d03 heartbeat: [15768]: info: Link d02:bond0 up. Sep 30 12:01:06 d03 heartbeat: [15768]: info: Status update for node d02: status active Sep 30 12:01:06 d03 heartbeat: [15768]: info: Link d02:bond1 up. Sep 30 12:01:06 d03 heartbeat: [15768]: info: Received shutdown notice from 'd02'. Sep 30 12:01:06 d03 heartbeat: [15768]: info: Resources being acquired from d02. Sep 30 12:01:06 d03 heartbeat: [15783]: info: acquire all HA resources (standby). Sep 30 12:01:06 d03 heartbeat: [15768]: info: Comm_now_up(): updating status to active Sep 30 12:01:06 d03 heartbeat: [15768]: info: Local status now set to: 'active' Sep 30 12:01:06 d03 heartbeat: [15768]: info: Starting child client "/usr/lib64/heartbeat/dopd" (0,0) Sep 30 12:01:06 d03 heartbeat: [15768]: info: Starting child client "/usr/lib64/heartbeat/ipfail" (0,0) Sep 30 12:01:06 d03 heartbeat: [15788]: info: Starting "/usr/lib64/heartbeat/dopd" as uid 0 gid 0 (pid 15788) Sep 30 12:01:06 d03 heartbeat: [15789]: info: Starting "/usr/lib64/heartbeat/ipfail" as uid 0 gid 0 (pid 15789) Sep 30 12:01:06 d03 heartbeat: [15784]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys d03] to acquire. Sep 30 12:01:06 d03 heartbeat: [15768]: info: Initial resource acquisition complete (T_RESOURCES(us))Sep 30 12:01:06 d03 harc(default)[15782]: info: Running /etc/ha.d//rc.d/status status Sep 30 12:01:06 d03 ResourceManager(default)[15820]: info: Acquiring resource group: d02 IPaddr::172.1.1.100/24/bond1 drbddisk::r0 Filesystem::/dev/drbd0::/mnt/drbd mds Sep 30 12:01:06 d03 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.1.1.100)[15858]: INFO: Resource is stopped Sep 30 12:01:06 d03 ResourceManager(default)[15820]: info: Running /etc/ha.d/resource.d/IPaddr 172.1.1.100/24/bond1 start Sep 30 12:01:06 d03 IPaddr(IPaddr_172.1.1.100)[15943]: INFO: Using calculated netmask for 172.1.1.100: 255.255.255.0 Sep 30 12:01:07 d03 IPaddr(IPaddr_172.1.1.100)[15943]: INFO: eval ifconfig bond1:0 172.1.1.100 netmask 255.255.255.0 broadcast 172.1.1.255 Sep 30 12:01:07 d03 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.1.1.100)[15917]: INFO: Success Sep 30 12:01:07 d03 ResourceManager(default)[15820]: info: Running /etc/ha.d/resource.d/drbddisk r0 start Sep 30 12:01:07 d03 kernel: block drbd0: role( Secondary -> Primary ) Sep 30 12:01:07 d03 /usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[16079]: INFO: Resource is stopped Sep 30 12:01:07 d03 ResourceManager(default)[15820]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/drbd start Sep 30 12:01:07 d03 Filesystem(Filesystem_/dev/drbd0)[16159]: INFO: Running start for /dev/drbd0 on /mnt/drbd Sep 30 12:01:07 d03 Filesystem(Filesystem_/dev/drbd0)[16159]: INFO: Starting filesystem check on /dev/drbd0 Sep 30 12:01:08 d03 ntpd[2582]: Listening on interface #12 bond1:0, 172.1.1.100#123 Enabled Sep 30 12:01:09 d03 kernel: EXT4-fs (drbd0): mounted filesystem with ordered data mode. Opts: Sep 30 12:01:09 d03 /usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[16150]: INFO: Success Sep 30 12:01:09 d03 ResourceManager(default)[15820]: info: Running /etc/ha.d/resource.d/mds start Sep 30 12:01:10 d03 heartbeat: [15783]: info: all HA resource acquisition completed (standby). Sep 30 12:01:10 d03 heartbeat: [15768]: info: Standby resource acquisition done [all]. Sep 30 12:01:10 d03 harc(default)[16324]: info: Running /etc/ha.d//rc.d/status status Sep 30 12:01:11 d03 ipfail: [15789]: info: Ping node count is balanced. Sep 30 12:01:12 d03 mach_down(default)[16341]: info: Taking over resource group IPaddr::172.1.1.100/24/bond1 Sep 30 12:01:12 d03 ResourceManager(default)[16368]: info: Acquiring resource group: d02 IPaddr::172.1.1.100/24/bond1 drbddisk::r0 Filesystem::/dev/drbd0::/mnt/drbd mds Sep 30 12:01:12 d03 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.1.1.100)[16396]: INFO: Running OK Sep 30 12:01:12 d03 /usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd0)[16468]: INFO: Running OK Sep 30 12:01:12 d03 mach_down(default)[16341]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired Sep 30 12:01:12 d03 mach_down(default)[16341]: info: mach_down takeover complete for node d02. Sep 30 12:01:12 d03 heartbeat: [15768]: info: mach_down takeover complete. Sep 30 12:01:17 d03 heartbeat: [15768]: info: Local Resource acquisition completed. (none) Sep 30 12:01:17 d03 heartbeat: [15768]: info: local resource transition completed. Sep 30 12:01:39 d03 heartbeat: [15768]: WARN: node d02: is dead Sep 30 12:01:39 d03 heartbeat: [15768]: info: Dead node d02 gave up resources. Sep 30 12:01:39 d03 ipfail: [15789]: info: Status update: Node d02 now has status dead Sep 30 12:01:39 d03 heartbeat: [15768]: info: Link d02:bond0 dead. Sep 30 12:01:39 d03 heartbeat: [15768]: info: Link d02:bond1 dead. Sep 30 12:01:40 d03 ipfail: [15789]: info: NS: We are still alive! Sep 30 12:01:40 d03 ipfail: [15789]: info: Link Status update: Link d02/bond0 now has status dead Sep 30 12:01:42 d03 ipfail: [15789]: info: Asking other side for ping node count. Sep 30 12:01:42 d03 ipfail: [15789]: info: Checking remote count of ping nodes. Sep 30 12:01:42 d03 ipfail: [15789]: info: Link Status update: Link d02/bond1 now has status dead Sep 30 12:01:43 d03 ipfail: [15789]: info: Asking other side for ping node count. Sep 30 12:01:43 d03 ipfail: [15789]: info: Checking remote count of ping nodes.Sep 30 12:15:21 d03 smbd[17096]: [2014/09/30 12:15:21.942313, 0] smbd/process.c:2440(keepalive_fn) Sep 30 12:15:21 d03 smbd[17096]: send_keepalive failed for client 0.0.0.0. Error Broken pipe - exiting Sep 30 13:13:16 d03 heartbeat: [15768]: info: Heartbeat restart on node d02 Sep 30 13:13:16 d03 heartbeat: [15768]: info: Link d02:bond0 up. Sep 30 13:13:16 d03 heartbeat: [15768]: info: Status update for node d02: status init Sep 30 13:13:16 d03 ipfail: [15789]: info: Link Status update: Link d02/bond0 now has status up Sep 30 13:13:16 d03 ipfail: [15789]: info: Status update: Node d02 now has status init Sep 30 13:13:16 d03 heartbeat: [15768]: info: Link d02:bond1 up. Sep 30 13:13:16 d03 ipfail: [15789]: info: Link Status update: Link d02/bond1 now has status up Sep 30 13:13:16 d03 heartbeat: [15768]: info: Status update for node d02: status up Sep 30 13:13:16 d03 ipfail: [15789]: info: Status update: Node d02 now has status up Sep 30 13:13:16 d03 harc(default)[29476]: info: Running /etc/ha.d//rc.d/status status Sep 30 13:13:16 d03 harc(default)[29493]: info: Running /etc/ha.d//rc.d/status status Sep 30 13:13:17 d03 heartbeat: [15768]: info: all clients are now paused Sep 30 13:13:18 d03 heartbeat: [15768]: info: Status update for node d02: status active Sep 30 13:13:18 d03 ipfail: [15789]: info: Status update: Node d02 now has status active Sep 30 13:13:18 d03 harc(default)[29510]: info: Running /etc/ha.d//rc.d/status status Sep 30 13:13:19 d03 heartbeat: [15768]: info: remote resource transition completed. Sep 30 13:13:25 d03 heartbeat: [15768]: info: all clients are now resumed Sep 30 13:13:27 d03 ipfail: [15789]: info: Asking other side for ping node count. Sep 30 13:13:29 d03 ipfail: [15789]: info: No giveup timer to abort. d02:/var/log/messageSep 30 13:13:12 d02 heartbeat: [18371]: info: Pacemaker support: false Sep 30 13:13:12 d02 heartbeat: [18371]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Sep 30 13:13:12 d02 heartbeat: [18371]: info: ************************** Sep 30 13:13:12 d02 heartbeat: [18371]: info: Configuration validated. Starting heartbeat 3.0.4 Sep 30 13:13:12 d02 heartbeat: [18372]: info: heartbeat: version 3.0.4 Sep 30 13:13:12 d02 heartbeat: [18372]: info: Heartbeat generation: 1372996313 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond0 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound send socket to device: bond0 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound receive socket to device: bond0 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: started on port 694 interface bond0 to 10.1.205.3 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond1 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound send socket to device: bond1 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound receive socket to device: bond1 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: started on port 694 interface bond1 to 172.1.1.3 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond2 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound send socket to device: bond2 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: bound receive socket to device: bond2 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ucast: started on port 694 interface bond2 to 192.168.128.3 Sep 30 13:13:12 d02 heartbeat: [18372]: info: glib: ping group heartbeat started. Sep 30 13:13:12 d02 heartbeat: [18372]: info: G_main_add_TriggerHandler: Added signal manual handler Sep 30 13:13:12 d02 heartbeat: [18372]: info: G_main_add_TriggerHandler: Added signal manual handler Sep 30 13:13:12 d02 heartbeat: [18372]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Sep 30 13:13:12 d02 heartbeat: [18372]: info: Local status now set to: 'up' Sep 30 13:13:13 d02 heartbeat: [18372]: info: Link mdsha:mdsha up. Sep 30 13:13:13 d02 heartbeat: [18372]: info: Status update for node mdsha: status ping Sep 30 13:13:14 d02 heartbeat: [18372]: info: Link d03:bond0 up. Sep 30 13:13:14 d02 heartbeat: [18372]: info: Status update for node d03: status active Sep 30 13:13:14 d02 heartbeat: [18372]: info: Link d03:bond1 up. Sep 30 13:13:14 d02 heartbeat: [18372]: info: Link d03:bond2 up. Sep 30 13:13:14 d02 harc(default)[18387]: info: Running /etc/ha.d//rc.d/status status Sep 30 13:13:14 d02 heartbeat: [18372]: info: Comm_now_up(): updating status to active Sep 30 13:13:14 d02 heartbeat: [18372]: info: Local status now set to: 'active' Sep 30 13:13:14 d02 heartbeat: [18372]: info: Starting child client "/usr/lib64/heartbeat/dopd" (0,0) Sep 30 13:13:14 d02 heartbeat: [18372]: info: Starting child client "/usr/lib64/heartbeat/ipfail" (0,0) Sep 30 13:13:14 d02 heartbeat: [18405]: info: Starting "/usr/lib64/heartbeat/dopd" as uid 0 gid 0 (pid 18405) Sep 30 13:13:14 d02 heartbeat: [18406]: info: Starting "/usr/lib64/heartbeat/ipfail" as uid 0 gid 0 (pid 18406) Sep 30 13:13:15 d02 heartbeat: [18372]: info: remote resource transition completed. Sep 30 13:13:15 d02 heartbeat: [18372]: info: remote resource transition completed. Sep 30 13:13:15 d02 heartbeat: [18372]: info: Local Resource acquisition completed. (none) Sep 30 13:13:15 d02 heartbeat: [18372]: info: Initial resource acquisition complete (T_RESOURCES(them)) Sep 30 13:13:25 d02 ipfail: [18406]: info: Ping node count is balanced. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
