Hello All,
        I would appreciate if you could help me on this problem I  
am facing with Apache HA with HB and MON.

        I have been working on setting up 2 node failover cluster for my
web service. I have installed the heartbeat 2.0.5 amd MON on the 2 SUSE
Linux servers. The MON is monitoring the Apache webserver. I tested two
methods  of causing failover and then a failback. I end up having a
split brain in the cluster in Method 1.


Method 1:

I find that SLAVENODE takes all the resource if I stop the heartbeat of
the MASTERNODE by running 'rcheartbeat stop', this is quite normal.
But If I do 'rcheartbeat start' on the MASTERNODE again to restart
heartbeat, the MASTERNODE thinks the SLAVENODE is dead and takes over
the resources ending up in a unrecoverable split-brain. 

Method 2:
Suprisingly, If I had caused the failover by pulling off the network
cable and the restored back the network cable followed by starting the
heartbeat again on the MASTERNODE,  I see that MASTERNODE senses the
SLAVENODE, SLAVENODE relinquishes resources to MASTER and it seems 
all fine.

I am not able to get why the Method-1 of failover is ending up with
a split brain.

My ha.cf and haresource are as below. 

debug 1
logfile /var/log/ha-log
keepalive 2
warntime 30
deadtime 80
initdead 90
node MASTERNODE
node SLAVENODE
bcast eth0
udpport 694
auto_failback on
ping_group ping-cluster-test 10.10.10.1 10.10.10.151
respawn hacluster /usr/lib/heartbeat/ipfail
crm off

Also attached are the master and slave dump when split brain occurs in
Method-1.

It would be great to get your solutios to this.


Regards
Shailesh P Shirali






 


heartbeat[15672]: 2007/08/02_11:08:06 info: respawn directive: hacluster 
/usr/lib/heartbeat/ipfail
heartbeat[15672]: 2007/08/02_11:08:06 info: AUTH: i=1: key = 0x80f1b28, 
auth=0xb7b13f00, authname=c rc
heartbeat[15672]: 2007/08/02_11:08:06 WARN: Core dumps could be lost if 
multiple dumps occur
heartbeat[15672]: 2007/08/02_11:08:06 WARN: Consider setting 
/proc/sys/kernel/core_uses_pid (or equ ivalent) to 1 for maximum supportability
heartbeat[15672]: 2007/08/02_11:08:06 WARN: Logging daemon is disabled 
--enabling logging daemon is  recommended
heartbeat[15672]: 2007/08/02_11:08:06 info: **************************
heartbeat[15672]: 2007/08/02_11:08:06 info: Configuration validated. Starting 
heartbeat 2.0.5
heartbeat[15673]: 2007/08/02_11:08:06 info: heartbeat: version 2.0.5
heartbeat[15673]: 2007/08/02_11:08:06 info: Heartbeat generation: 165
heartbeat[15673]: 2007/08/02_11:08:06 info: G_main_add_TriggerHandler: Added 
signal manual handler
heartbeat[15673]: 2007/08/02_11:08:06 info: G_main_add_TriggerHandler: Added 
signal manual handler
heartbeat[15673]: 2007/08/02_11:08:06 info: Removing /var/run/heartbeat/rsctmp 
failed, recreating.
heartbeat[15673]: 2007/08/02_11:08:06 info: glib: UDP Broadcast heartbeat 
started on port 694 (694)  interface eth0
heartbeat[15673]: 2007/08/02_11:08:06 info: glib: UDP Broadcast heartbeat 
closed on port 694 interf ace eth0 - Status: 1
heartbeat[15673]: 2007/08/02_11:08:06 info: glib: ping group heartbeat started.
heartbeat[15673]: 2007/08/02_11:08:06 info: G_main_add_SignalHandler: Added 
signal handler for sign al 17
heartbeat[15673]: 2007/08/02_11:08:06 info: Local status now set to: 'up'
heartbeat[15673]: 2007/08/02_11:08:07 info: Link 
ping-cluster-test:ping-cluster-test up.
heartbeat[15673]: 2007/08/02_11:08:07 info: Status update for node 
ping-cluster-test: status ping
heartbeat[15673]: 2007/08/02_11:08:07 info: Link MASTERNODE:eth0 up.
heartbeat[15673]: 2007/08/02_11:09:37 WARN: node SLAVENODE: is dead
heartbeat[15673]: 2007/08/02_11:09:37 info: Comm_now_up(): updating status to 
active
heartbeat[15673]: 2007/08/02_11:09:37 info: Local status now set to: 'active'
heartbeat[15673]: 2007/08/02_11:09:37 info: Starting child client 
"/usr/lib/heartbeat/ipfail" (90,90)
heartbeat[15673]: 2007/08/02_11:09:37 WARN: No STONITH device configured.
heartbeat[15673]: 2007/08/02_11:09:37 WARN: Shared disks are not protected.
heartbeat[15673]: 2007/08/02_11:09:37 info: Resources being acquired from 
SLAVENODE.
heartbeat[15685]: 2007/08/02_11:09:37 info: Starting 
"/usr/lib/heartbeat/ipfail" as uid 90  gid 90 (pid 15685)
harc[15686]:    2007/08/02_11:09:37 info: Running /etc/ha.d/rc.d/status status
mach_down[15699]:       2007/08/02_11:09:37 info: /usr/lib/heartbeat/mach_down: 
nice_failback: foreign resources acquired
mach_down[15699]:       2007/08/02_11:09:37 info: mach_down takeover complete 
for node SLAVENODE.
heartbeat[15673]: 2007/08/02_11:09:37 info: AnnounceTakeover(local 1, foreign 
0, reason 'T_RESOURCES(us)' (0))
heartbeat[15673]: 2007/08/02_11:09:37 info: mach_down takeover complete.
heartbeat[15673]: 2007/08/02_11:09:37 info: AnnounceTakeover(local 1, foreign 
1, reason 'mach_down' (0))
heartbeat[15673]: 2007/08/02_11:09:37 info: Initial resource acquisition 
complete (mach_down)
heartbeat[15673]: 2007/08/02_11:09:37 info: STATE 1 => 3
heartbeat[15673]: 2007/08/02_11:09:37 info: Exiting status process 15686 
returned rc 0.
IPaddr[15747]:  2007/08/02_11:09:37 INFO: IPaddr Resource is stopped
req_resource[15714]:    2007/08/02_11:09:37 debug: in 
/usr/lib/heartbeat/req_resource 10.10.10.157
req_resource[15714]:    2007/08/02_11:09:37 debug: dont_ask: yes nice_failback: 
yes
heartbeat[15687]: 2007/08/02_11:09:37 info: 1 local resources from 
[/usr/lib/heartbeat/ResourceManager listkeys MASTERNODE]
heartbeat[15687]: 2007/08/02_11:09:37 info: Local Resource acquisition 
completed.
heartbeat[15687]: 2007/08/02_11:09:37 info: FIFO message [type resource] 
written rc=79
heartbeat[15673]: 2007/08/02_11:09:37 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[15673]: 2007/08/02_11:09:37 info: Exiting req_our_resources process 
15687 returned rc 0.
heartbeat[15673]: 2007/08/02_11:09:37 info: AnnounceTakeover(local 1, foreign 
1, reason 'req_our_resources' (1))
harc[15853]:    2007/08/02_11:09:37 info: Running 
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[15853]: 2007/08/02_11:09:37 received ip-request-resp 
10.10.10.157 OK yes
ResourceManager[15866]: 2007/08/02_11:09:37 info: Acquiring resource group: 
MASTERNODE 10.10.10.157 apache2 mon
IPaddr[15889]:  2007/08/02_11:09:37 INFO: IPaddr Resource is stopped
ResourceManager[15866]: 2007/08/02_11:09:37 info: Running 
/etc/ha.d/resource.d/IPaddr 10.10.10.157 start
IPaddr[16075]:  2007/08/02_11:09:37 INFO: /sbin/ifconfig eth0:0 10.10.10.157 
netmask 255.255.255.0 broadcast 10.10.10.255
IPaddr[16075]:  2007/08/02_11:09:37 INFO: Sending Gratuitous Arp for 
10.10.10.157 on eth0:0 [eth0]
IPaddr[16075]:  2007/08/02_11:09:37 INFO: /usr/lib/heartbeat/send_arp -i 500 -r 
10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.10.10.157 eth0 
10.10.10.157 auto 10.10.10.157 ffffffffffff
IPaddr[16005]:  2007/08/02_11:09:37 INFO: IPaddr Success
ResourceManager[15866]: 2007/08/02_11:09:37 info: Running 
/etc/ha.d/resource.d/apache2  start
heartbeat[15673]: 2007/08/02_11:09:39 info: Link SLAVENODE:eth0 up.
heartbeat[15673]: 2007/08/02_11:09:39 info: Status update for node SLAVENODE: 
status active
ResourceManager[15866]: 2007/08/02_11:09:43 info: Running 
/etc/ha.d/resource.d/mon  start
heartbeat[15673]: 2007/08/02_11:09:43 info: Exiting ip-request-resp process 
15853 returned rc 0.
heartbeat[15673]: 2007/08/02_11:09:43 info: AnnounceTakeover(local 1, foreign 
1, reason 'ip-request-resp' (1))
harc[16254]:    2007/08/02_11:09:43 info: Running /etc/ha.d/rc.d/status status
heartbeat[15673]: 2007/08/02_11:09:43 info: Exiting status process 16254 
returned rc 0.
heartbeat[15673]: 2007/08/02_11:09:47 info: Local Resource acquisition 
completed. (none)
heartbeat[15673]: 2007/08/02_11:09:47 info: local resource transition completed.
heartbeat[15673]: 2007/08/02_11:09:47 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[15673]: 2007/08/02_11:09:58 info: MASTERNODE wants to go standby 
[foreign]
heartbeat[15673]: 2007/08/02_11:09:58 info: i_hold_resources: 3
heartbeat[15673]: 2007/08/02_11:09:58 info: New standby state: 1
heartbeat[15673]: 2007/08/02_11:09:58 info: standby: SLAVENODE can take our 
foreign resources
heartbeat[15673]: 2007/08/02_11:09:58 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[15673]: 2007/08/02_11:09:58 info: New standby state: 1
heartbeat[16265]: 2007/08/02_11:09:58 info: give up foreign HA resources 
(standby).
heartbeat[16265]: 2007/08/02_11:09:58 info: go_standby: who: 1 resource set: 
foreign
heartbeat[16265]: 2007/08/02_11:09:58 info: go_standby: (query/action): 
(otherkeys/givegroup)
heartbeat[16265]: 2007/08/02_11:09:58 info: foreign HA resource release 
completed (standby).
heartbeat[16265]: 2007/08/02_11:09:58 info: FIFO message [type ask_resources] 
written rc=51
heartbeat[15673]: 2007/08/02_11:09:58 info: Local standby process completed 
[foreign].
heartbeat[15673]: 2007/08/02_11:09:58 info: New standby state: 3
heartbeat[15673]: 2007/08/02_11:09:58 info: Exiting go_standby process 16265 
returned rc 0.
heartbeat[15673]: 2007/08/02_11:09:59 WARN: 1 lost packet(s) for [SLAVENODE] 
[374:376]
heartbeat[15673]: 2007/08/02_11:09:59 info: remote resource transition 
completed.
heartbeat[15673]: 2007/08/02_11:09:59 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[15673]: 2007/08/02_11:09:59 ERROR: Both machines own our resources!
heartbeat[15673]: 2007/08/02_11:09:59 info: other_holds_resources: 3
heartbeat[15673]: 2007/08/02_11:09:59 ERROR: Both machines own our resources!
heartbeat[15673]: 2007/08/02_11:09:59 info: No pkts missing from SLAVENODE!
heartbeat[15673]: 2007/08/02_11:09:59 info: Other node completed standby 
takeover of foreign resources.
heartbeat[15673]: 2007/08/02_11:09:59 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[15673]: 2007/08/02_11:09:59 ERROR: Both machines own our resources!
heartbeat[15673]: 2007/08/02_11:09:59 info: New standby state: 0
heartbeat[15673]: 2007/08/02_11:09:59 info: other_holds_resources: 3
heartbeat[15673]: 2007/08/02_11:09:59 ERROR: Both machines own our resources!
heartbeat[22111]: 2007/07/31_11:06:26 info: Heartbeat restart on node MASTERNODE
heartbeat[22111]: 2007/07/31_11:06:26 info: Link MASTERNODE:eth0 up.
heartbeat[22111]: 2007/07/31_11:06:26 info: Status update for node MASTERNODE: 
status init
heartbeat[22111]: 2007/07/31_11:06:26 info: Status update for node MASTERNODE: 
status up
harc[22855]:    2007/07/31_11:06:26 info: Running /etc/ha.d/rc.d/status status
heartbeat[22111]: 2007/07/31_11:06:26 info: Exiting status process 22855 
returned rc 0.
harc[22864]:    2007/07/31_11:06:26 info: Running /etc/ha.d/rc.d/status status
heartbeat[22111]: 2007/07/31_11:06:26 info: Exiting status process 22864 
returned rc 0.
heartbeat[22111]: 2007/07/31_11:06:37 info: all clients are now paused
heartbeat[22111]: 2007/07/31_11:07:56 WARN: 1 lost packet(s) for [MASTERNODE] 
[49:51]
heartbeat[22111]: 2007/07/31_11:07:56 info: Status update for node MASTERNODE: 
status active
heartbeat[22111]: 2007/07/31_11:07:56 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[22111]: 2007/07/31_11:07:56 info: No pkts missing from MASTERNODE!
heartbeat[22111]: 2007/07/31_11:07:56 info: other_holds_resources: 2
heartbeat[22111]: 2007/07/31_11:07:56 ERROR: Both machines own our resources!
heartbeat[22111]: 2007/07/31_11:07:56 info: remote resource transition 
completed.
heartbeat[22111]: 2007/07/31_11:07:56 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[22111]: 2007/07/31_11:07:56 ERROR: Both machines own our resources!
heartbeat[22111]: 2007/07/31_11:07:56 ERROR: Both machines own foreign 
resources!
heartbeat[22111]: 2007/07/31_11:07:56 info: SLAVENODE wants to go standby 
[foreign]
heartbeat[22111]: 2007/07/31_11:07:56 info: i_hold_resources: 3
heartbeat[22111]: 2007/07/31_11:07:56 info: New standby state: 1
heartbeat[22111]: 2007/07/31_11:07:56 info: other_holds_resources: 3
heartbeat[22111]: 2007/07/31_11:07:56 ERROR: Both machines own our resources!
heartbeat[22111]: 2007/07/31_11:07:56 ERROR: Both machines own foreign 
resources!
harc[22884]:    2007/07/31_11:07:56 info: Running /etc/ha.d/rc.d/status status
heartbeat[22111]: 2007/07/31_11:07:56 info: Exiting status process 22884 
returned rc 0.
heartbeat[22111]: 2007/07/31_11:08:06 WARN: No reply to standby request.  
Standby request cancelled.
heartbeat[22111]: 2007/07/31_11:08:06 info: other_holds_resources: 3
heartbeat[22111]: 2007/07/31_11:08:06 ERROR: Both machines own our resources!
heartbeat[22111]: 2007/07/31_11:08:06 ERROR: Both machines own foreign 
resources!
heartbeat[22111]: 2007/07/31_11:08:07 ERROR: Message hist queue is filling up 
(151 messages in queue)
heartbeat[22111]: 2007/07/31_11:08:08 info: all clients are now resumed
heartbeat[22111]: 2007/07/31_11:08:17 info: MASTERNODE wants to go standby 
[foreign]
heartbeat[22111]: 2007/07/31_11:08:17 info: standby: other_holds_resources: 3
heartbeat[22111]: 2007/07/31_11:08:17 info: New standby state: 2
heartbeat[22111]: 2007/07/31_11:08:17 info: New standby state: 2
heartbeat[22111]: 2007/07/31_11:08:17 info: other_holds_resources: 1
heartbeat[22111]: 2007/07/31_11:08:17 ERROR: Both machines own foreign 
resources!
heartbeat[22111]: 2007/07/31_11:08:17 info: standby: acquire [foreign] 
resources from MASTERNODE
heartbeat[22111]: 2007/07/31_11:08:17 info: New standby state: 3
heartbeat[22899]: 2007/07/31_11:08:17 info: acquire local HA resources 
(standby).
heartbeat[22899]: 2007/07/31_11:08:17 info: go_standby: who: 2 resource set: 
local
heartbeat[22899]: 2007/07/31_11:08:17 info: go_standby: (query/action): 
(ourkeys/takegroup)
heartbeat[22899]: 2007/07/31_11:08:17 info: local HA resource acquisition 
completed (standby).
heartbeat[22899]: 2007/07/31_11:08:17 info: FIFO message [type ask_resources] 
written rc=51
heartbeat[22111]: 2007/07/31_11:08:17 info: Standby resource acquisition done 
[foreign].
heartbeat[22111]: 2007/07/31_11:08:17 info: AnnounceTakeover(local 1, foreign 
1, reason 'auto_failback' (1))
heartbeat[22111]: 2007/07/31_11:08:17 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[22111]: 2007/07/31_11:08:17 ERROR: Both machines own foreign 
resources!
heartbeat[22111]: 2007/07/31_11:08:17 info: New standby state: 0
heartbeat[22111]: 2007/07/31_11:08:17 info: Exiting go_standby process 22899 
returned rc 0.
heartbeat[22111]: 2007/07/31_11:08:18 info: remote resource transition 
completed.
heartbeat[22111]: 2007/07/31_11:08:18 info: AnnounceTakeover(local 1, foreign 
1, reason 'T_RESOURCES(us)' (1))
heartbeat[22111]: 2007/07/31_11:08:18 ERROR: Both machines own foreign 
resources!
heartbeat[22111]: 2007/07/31_11:08:18 info: other_holds_resources: 1
heartbeat[22111]: 2007/07/31_11:08:18 ERROR: Both machines own foreign 
resources!
heartbeat[22111]: 2007/07/31_11:08:18 info: other_holds_resources: 1
heartbeat[22111]: 2007/07/31_11:08:18 ERROR: Both machines own foreign 
resources!
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to