Hi,
On Mon, Dec 22, 2008 at 07:50:20AM -0800, Robinson, Eric wrote:
> We have 2 nodes running heartbeat 2.1.3
>
> Node 1 (hostname 'ha03') is primary for resource name 'ha_mysql'
>
> Node 2 (hostname 'ha04') is primary for resource name 'ha_ftp'
>
> For two days, Node 2 was offline while we upgraded its kernel
> and drbd versions. It's back up and now we're trying to upgrade
> Node 1. When we try to force Node 1 to go standby, it succeeds.
> A few seconds later it fails back.
It? What fails back?
> However, resource 'ha_ftp' did not fail back. Node 2 kept it
> (perhaps because it it primary for that resource?).
Don't understand what's going on. ha_ftp is not touched,
according to your logs.
Thanks,
Dejan
>
> ha.cf from Node 1
> -----------------
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> logfacility local0
> traditional_compression false
> keepalive 2
> deadtime 30
> warntime 10
> initdead 120
> udpport 696
> baud 19200
> serial /dev/ttyS0
> bcast bond0
> #bcast eth1
> #mcast eth0 225.0.0.1 696 1 0
> auto_failback off
> #watchdog /dev/watchdog
> node ha03.domain-name-censored.local
> node ha04.domain-name-censored.local
> respawn hacluster /usr/lib/heartbeat/ipfail
> ping 192.168.10.100
> debug 1
> apiauth ipfail gid=haclient uid=hacluster
> #apiauth ccm uid=hacluster
> #apiauth ipfail gid=haclient uid=alanr,root
> #apiauth default gid=haclient
>
> ha.cf from Node 2
> -----------------
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> logfacility local0
> traditional_compression false
> keepalive 2
> deadtime 30
> warntime 10
> initdead 120
> udpport 696
> baud 19200
> serial /dev/ttyS0
> bcast bond0
> #bcast eth1
> #mcast eth0 225.0.0.1 696 1 0
> auto_failback off
> #watchdog /dev/watchdog
> node ha03.domain-name-censored.local
> node ha04.domain-name-censored.local
> respawn hacluster /usr/lib/heartbeat/ipfail
> ping 192.168.10.100
> debug 1
> apiauth ipfail gid=haclient uid=hacluster
> #apiauth ccm uid=hacluster
> #apiauth ipfail gid=haclient uid=alanr,root
> #apiauth default gid=haclient
>
> ha-debug from Node 1
> --------------------
> heartbeat[9733]: 2008/12/22_07:27:26 debug: StartNextRemoteRscReq() - calling
> hook
> heartbeat[9733]: 2008/12/22_07:27:26 debug: notify_world: invoking harc: OLD
> status: active
> heartbeat[9733]: 2008/12/22_07:27:26 debug: Process [hb_takeover] started pid
> 17604
> heartbeat[9733]: 2008/12/22_07:27:26 debug: Starting notify process
> [hb_takeover]
> heartbeat[17604]: 2008/12/22_07:27:26 debug: notify_world: setting SIGCHLD
> Handler to SIG_DFL
> heartbeat[17604]: 2008/12/22_07:27:26 debug: notify_world: Running harc
> hb_takeover
> harc[17604]: 2008/12/22_07:27:26 info: Running /etc/ha.d/rc.d/hb_takeover
> hb_takeover
> hb_standby[17620]: 2008/12/22_07:27:26 Going standby [local].
> heartbeat[9733]: 2008/12/22_07:27:26 debug: Received standby message me from
> ha03.domain-name-censored.local in state 0
> heartbeat[9733]: 2008/12/22_07:27:26 debug: ask_for_resources: other now
> unstable
> heartbeat[9733]: 2008/12/22_07:27:26 info: ha03.domain-name-censored.local
> wants to go standby [local]
> heartbeat[9733]: 2008/12/22_07:27:26 info: i_hold_resources: 1
> heartbeat[9733]: 2008/12/22_07:27:26 info: New standby state: 1
> heartbeat[9733]: 2008/12/22_07:27:26 info: Managed hb_takeover process 17604
> exited with return code 0.
> heartbeat[9733]: 2008/12/22_07:27:26 debug: RscMgmtProc 'hb_takeover' exited
> code 0
> heartbeat[9733]: 2008/12/22_07:27:26 debug: Received standby message other
> from ha04.domain-name-censored.local in state 1
> heartbeat[9733]: 2008/12/22_07:27:26 info: standby:
> ha04.domain-name-censored.local can take our local resources
> heartbeat[9733]: 2008/12/22_07:27:26 debug: go_standby: other is unstable
> heartbeat[9733]: 2008/12/22_07:27:26 debug: Sending hold resources msg: none,
> stable=0 # standby
> heartbeat[9733]: 2008/12/22_07:27:26 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: 0,
> going_standby: 1, standby running(ms): -521182662, resourcestate: 4
> heartbeat[9733]: 2008/12/22_07:27:26 debug: Process [go_standby] started pid
> 17634
> heartbeat[9733]: 2008/12/22_07:27:26 info: New standby state: 1
> heartbeat[17634]: 2008/12/22_07:27:26 info: give up local HA resources
> (standby).
> heartbeat[17634]: 2008/12/22_07:27:26 info: go_standby: who: 1 resource set:
> local
> heartbeat[17634]: 2008/12/22_07:27:26 info: go_standby: (query/action):
> (ourkeys/givegroup)
> ResourceManager[17647]: 2008/12/22_07:27:26 info: Releasing resource group:
> ha03.domain-name-censored.local drbddisk::ha_mysql
> Filesystem::/dev/drbd0::/ha_mysql::ext3 IPaddr2::192.168.10.201/24/bond0
> mysql_001 mysql_002
> ResourceManager[17647]: 2008/12/22_07:27:26 info: Running
> /etc/init.d/mysql_002 stop
> ResourceManager[17647]: 2008/12/22_07:27:26 debug: Starting
> /etc/init.d/mysql_002 stop
> Killing mysqld with pid 17298
> ResourceManager[17647]: 2008/12/22_07:27:27 debug: /etc/init.d/mysql_002
> stop done. RC=0
> ResourceManager[17647]: 2008/12/22_07:27:27 info: Running
> /etc/init.d/mysql_001 stop
> ResourceManager[17647]: 2008/12/22_07:27:27 debug: Starting
> /etc/init.d/mysql_001 stop
> Killing mysqld with pid 17281
> ResourceManager[17647]: 2008/12/22_07:27:28 debug: /etc/init.d/mysql_001
> stop done. RC=0
> ResourceManager[17647]: 2008/12/22_07:27:28 info: Running
> /etc/ha.d/resource.d/IPaddr2 192.168.10.201/24/bond0 stop
> ResourceManager[17647]: 2008/12/22_07:27:28 debug: Starting
> /etc/ha.d/resource.d/IPaddr2 192.168.10.201/24/bond0 stop
> IPaddr2[17782]: 2008/12/22_07:27:28 INFO: ip -f inet addr delete
> 192.168.10.201/24 dev bond0
> IPaddr2[17782]: 2008/12/22_07:27:28 INFO: ip -o -f inet addr show bond0
> IPaddr2[17753]: 2008/12/22_07:27:28 INFO: Success
> INFO: Success
> ResourceManager[17647]: 2008/12/22_07:27:28 debug:
> /etc/ha.d/resource.d/IPaddr2 192.168.10.201/24/bond0 stop done. RC=0
> ResourceManager[17647]: 2008/12/22_07:27:28 info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_mysql ext3 stop
> ResourceManager[17647]: 2008/12/22_07:27:28 debug: Starting
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_mysql ext3 stop
> Filesystem[17863]: 2008/12/22_07:27:28 INFO: Running stop for /dev/drbd0
> on /ha_mysql
> Filesystem[17863]: 2008/12/22_07:27:28 INFO: Trying to unmount /ha_mysql
> Filesystem[17863]: 2008/12/22_07:27:28 INFO: unmounted /ha_mysql
> successfully
> Filesystem[17852]: 2008/12/22_07:27:28 INFO: Success
> INFO: Success
> ResourceManager[17647]: 2008/12/22_07:27:28 debug:
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_mysql ext3 stop done. RC=0
> ResourceManager[17647]: 2008/12/22_07:27:28 info: Running
> /etc/ha.d/resource.d/drbddisk ha_mysql stop
> ResourceManager[17647]: 2008/12/22_07:27:28 debug: Starting
> /etc/ha.d/resource.d/drbddisk ha_mysql stop
> ResourceManager[17647]: 2008/12/22_07:27:28 debug:
> /etc/ha.d/resource.d/drbddisk ha_mysql stop done. RC=0
> heartbeat[17634]: 2008/12/22_07:27:28 info: local HA resource release
> completed (standby).
> heartbeat[17634]: 2008/12/22_07:27:28 debug: Sending standby [done] msg
> heartbeat[17634]: 2008/12/22_07:27:28 info: FIFO message [type ask_resources]
> written rc=49
> heartbeat[9733]: 2008/12/22_07:27:28 debug: Received standby message done
> from ha03.domain-name-censored.local in state 1
> heartbeat[9733]: 2008/12/22_07:27:28 info: Local standby process completed
> [local].
> heartbeat[9733]: 2008/12/22_07:27:28 info: New standby state: 3
> heartbeat[9733]: 2008/12/22_07:27:28 info: Managed go_standby process 17634
> exited with return code 0.
> heartbeat[9733]: 2008/12/22_07:27:28 debug: RscMgmtProc 'go_standby' exited
> code 0
> heartbeat[9733]: 2008/12/22_07:27:50 WARN: 1 lost packet(s) for
> [ha04.domain-name-censored.local] [3100:3102]
> heartbeat[9733]: 2008/12/22_07:27:50 info: remote resource transition
> completed.
> heartbeat[9733]: 2008/12/22_07:27:50 debug: Sending hold resources msg: none,
> stable=1 # <none>
> heartbeat[9733]: 2008/12/22_07:27:50 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 3, standby running(ms): -521180402, resourcestate: 4
> heartbeat[9733]: 2008/12/22_07:27:50 debug: Calling PerformAutoFailback()
> heartbeat[9733]: 2008/12/22_07:27:50 info: other_holds_resources: 3
> heartbeat[9733]: 2008/12/22_07:27:50 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 3, standby running(ms): -521180402, resourcestate: 4
> ipfail[10103]: 2008/12/22_07:27:50 debug: Other side is now stable.
> heartbeat[9733]: 2008/12/22_07:27:50 info: No pkts missing from
> ha04.domain-name-censored.local!
> heartbeat[9733]: 2008/12/22_07:27:50 debug: Received standby message done
> from ha04.domain-name-censored.local in state 3
> heartbeat[9733]: 2008/12/22_07:27:50 info: Other node completed standby
> takeover of local resources.
> heartbeat[9733]: 2008/12/22_07:27:50 debug: Sending hold resources msg: none,
> stable=1 # <none>
> heartbeat[9733]: 2008/12/22_07:27:50 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 0, standby running(ms): 0, resourcestate: 4
> heartbeat[9733]: 2008/12/22_07:27:50 info: New standby state: 0
> heartbeat[9733]: 2008/12/22_07:27:51 info: other_holds_resources: 3
> heartbeat[9733]: 2008/12/22_07:27:51 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 0, standby running(ms): 0, resourcestate: 4
> ipfail[10103]: 2008/12/22_07:27:51 debug: Other side is now stable.
> heartbeat[9733]: 2008/12/22_07:28:21 debug: Received standby message me from
> ha04.domain-name-censored.local in state 0
> heartbeat[9733]: 2008/12/22_07:28:21 debug: ask_for_resources: other now
> unstable
> heartbeat[9733]: 2008/12/22_07:28:21 info: ha04.domain-name-censored.local
> wants to go standby [foreign]
> heartbeat[9733]: 2008/12/22_07:28:21 info: standby: other_holds_resources: 3
> heartbeat[9733]: 2008/12/22_07:28:21 debug: Sending standby [other] msg
> heartbeat[9733]: 2008/12/22_07:28:21 debug: Received standby message other
> from ha03.domain-name-censored.local in state 2
> heartbeat[9733]: 2008/12/22_07:28:21 info: New standby state: 2
> heartbeat[9733]: 2008/12/22_07:28:21 info: New standby state: 2
> heartbeat[9733]: 2008/12/22_07:28:21 debug: process_resources(2): other now
> unstable
> heartbeat[9733]: 2008/12/22_07:28:21 info: other_holds_resources: 1
> heartbeat[9733]: 2008/12/22_07:28:21 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: 0,
> going_standby: 2, standby running(ms): -521128082, resourcestate: 4
> ipfail[10103]: 2008/12/22_07:28:21 debug: Other side is unstable.
> heartbeat[9733]: 2008/12/22_07:28:42 debug: Received standby message done
> from ha04.domain-name-censored.local in state 2
> heartbeat[9733]: 2008/12/22_07:28:42 info: standby: acquire [foreign]
> resources from ha04.domain-name-censored.local
> heartbeat[9733]: 2008/12/22_07:28:42 debug: Process [go_standby] started pid
> 18012
> heartbeat[9733]: 2008/12/22_07:28:42 info: New standby state: 3
> heartbeat[18012]: 2008/12/22_07:28:42 info: acquire local HA resources
> (standby).
> heartbeat[18012]: 2008/12/22_07:28:42 info: go_standby: who: 2 resource set:
> local
> heartbeat[18012]: 2008/12/22_07:28:42 info: go_standby: (query/action):
> (ourkeys/takegroup)
> ResourceManager[18025]: 2008/12/22_07:28:42 info: Acquiring resource group:
> ha03.domain-name-censored.local drbddisk::ha_mysql
> Filesystem::/dev/drbd0::/ha_mysql::ext3 IPaddr2::192.168.10.201/24/bond0
> mysql_001 mysql_002
> ResourceManager[18025]: 2008/12/22_07:28:42 info: Running
> /etc/ha.d/resource.d/drbddisk ha_mysql start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug: Starting
> /etc/ha.d/resource.d/drbddisk ha_mysql start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug:
> /etc/ha.d/resource.d/drbddisk ha_mysql start done. RC=0
> Filesystem[18093]: 2008/12/22_07:28:42 INFO: Resource is stopped
> ResourceManager[18025]: 2008/12/22_07:28:42 info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_mysql ext3 start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug: Starting
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_mysql ext3 start
> Filesystem[18174]: 2008/12/22_07:28:42 INFO: Running start for
> /dev/drbd0 on /ha_mysql
> Filesystem[18163]: 2008/12/22_07:28:42 INFO: Success
> INFO: Success
> ResourceManager[18025]: 2008/12/22_07:28:42 debug:
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_mysql ext3 start done. RC=0
> IPaddr2[18248]: 2008/12/22_07:28:42 INFO: Resource is stopped
> ResourceManager[18025]: 2008/12/22_07:28:42 info: Running
> /etc/ha.d/resource.d/IPaddr2 192.168.10.201/24/bond0 start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug: Starting
> /etc/ha.d/resource.d/IPaddr2 192.168.10.201/24/bond0 start
> IPaddr2[18360]: 2008/12/22_07:28:42 INFO: ip -f inet addr add
> 192.168.10.201/24 brd 192.168.10.255 dev bond0
> IPaddr2[18360]: 2008/12/22_07:28:42 INFO: ip link set bond0 up
> IPaddr2[18360]: 2008/12/22_07:28:42 INFO: /usr/lib/heartbeat/send_arp -i 200
> -r 5 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.10.201 bond0
> 192.168.10.201 auto not_used not_used
> IPaddr2[18331]: 2008/12/22_07:28:42 INFO: Success
> INFO: Success
> ResourceManager[18025]: 2008/12/22_07:28:42 debug:
> /etc/ha.d/resource.d/IPaddr2 192.168.10.201/24/bond0 start done. RC=0
> ResourceManager[18025]: 2008/12/22_07:28:42 info: Running
> /etc/init.d/mysql_001 start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug: Starting
> /etc/init.d/mysql_001 start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug: /etc/init.d/mysql_001
> start done. RC=0
> ResourceManager[18025]: 2008/12/22_07:28:42 info: Running
> /etc/init.d/mysql_002 start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug: Starting
> /etc/init.d/mysql_002 start
> ResourceManager[18025]: 2008/12/22_07:28:42 debug: /etc/init.d/mysql_002
> start done. RC=0
> heartbeat[18012]: 2008/12/22_07:28:42 info: local HA resource acquisition
> completed (standby).
> heartbeat[18012]: 2008/12/22_07:28:42 debug: Sending standby [done] msg
> heartbeat[18012]: 2008/12/22_07:28:42 info: FIFO message [type ask_resources]
> written rc=51
> heartbeat[9733]: 2008/12/22_07:28:42 debug: Received standby message done
> from ha03.domain-name-censored.local in state 3
> heartbeat[9733]: 2008/12/22_07:28:42 info: Standby resource acquisition done
> [foreign].
> heartbeat[9733]: 2008/12/22_07:28:42 debug: Sending hold resources msg:
> local, stable=1 # <none>
> heartbeat[9733]: 2008/12/22_07:28:42 info: AnnounceTakeover(local 1, foreign
> 1, reason 'T_RESOURCES(us)' (1))
> heartbeat[9733]: 2008/12/22_07:28:42 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: 0,
> going_standby: 0, standby running(ms): 0, resourcestate: 4
> heartbeat[9733]: 2008/12/22_07:28:42 info: New standby state: 0
> heartbeat[9733]: 2008/12/22_07:28:42 info: Managed go_standby process 18012
> exited with return code 0.
> heartbeat[9733]: 2008/12/22_07:28:42 debug: RscMgmtProc 'go_standby' exited
> code 0
> heartbeat[9733]: 2008/12/22_07:28:43 info: remote resource transition
> completed.
> heartbeat[9733]: 2008/12/22_07:28:43 debug: Sending hold resources msg:
> local, stable=1 # <none>
> heartbeat[9733]: 2008/12/22_07:28:43 info: AnnounceTakeover(local 1, foreign
> 1, reason 'T_RESOURCES(us)' (1))
> heartbeat[9733]: 2008/12/22_07:28:43 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 0, standby running(ms): 0, resourcestate: 4
> heartbeat[9733]: 2008/12/22_07:28:43 debug: Calling PerformAutoFailback()
> heartbeat[9733]: 2008/12/22_07:28:43 info: other_holds_resources: 1
> heartbeat[9733]: 2008/12/22_07:28:43 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 0, standby running(ms): 0, resourcestate: 4
> ipfail[10103]: 2008/12/22_07:28:43 debug: Other side is now stable.
> heartbeat[9733]: 2008/12/22_07:28:43 info: other_holds_resources: 1
> heartbeat[9733]: 2008/12/22_07:28:43 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 0, standby running(ms): 0, resourcestate: 4
> ipfail[10103]: 2008/12/22_07:28:43 debug: Other side is now stable.
> heartbeat[9733]: 2008/12/22_07:29:23 debug: APIregistration_dispatch() {
> heartbeat[9733]: 2008/12/22_07:29:23 debug: process_registerevent() {
> heartbeat[9733]: 2008/12/22_07:29:23 debug: client->gsource = 0x8bcdb40
> heartbeat[9733]: 2008/12/22_07:29:23 debug: }/*process_registerevent*/;
> heartbeat[9733]: 2008/12/22_07:29:23 debug: }/*APIregistration_dispatch*/;
> heartbeat[9733]: 2008/12/22_07:29:23 debug: Checking client authorization for
> client 18641 (0:496)
> heartbeat[9733]: 2008/12/22_07:29:23 debug: create_seq_snapshot_table:no
> missing packets found for node ha03.domain-name-censored.local
> heartbeat[9733]: 2008/12/22_07:29:23 debug: create_seq_snapshot_table:no
> missing packets found for node ha04.domain-name-censored.local
> heartbeat[9733]: 2008/12/22_07:29:23 debug: Signing on API client 18641
> ('casual')
> heartbeat[9733]: 2008/12/22_07:29:23 debug: hb_rsc_isstable:
> ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0,
> going_standby: 0, standby running(ms): 0, resourcestate: 4
> heartbeat[9733]: 2008/12/22_07:29:23 debug: Signing client 18641 off
> heartbeat[9733]: 2008/12/22_07:29:23 debug: G_remove_client(pid=18641,
> reason='signoff' gsource=0x8bcdb40) {
> heartbeat[9733]: 2008/12/22_07:29:23 debug: api_remove_client_int: removing
> pid [18641] reason: signoff
> heartbeat[9733]: 2008/12/22_07:29:23 debug: }/*G_remove_client;*/
>
> --Eric
>
> Sorry for annoying server-appended disclaimer
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>
>
>
>
> Disclaimer - December 22, 2008
> This email and any files transmitted with it are confidential and intended
> solely for General Linux-HA mailing list,[email protected]. If you
> are not the named addressee you should not disseminate, distribute, copy or
> alter this email. Any views or opinions presented in this email are solely
> those of the author and might not represent those of . Warning: Although has
> taken reasonable precautions to ensure no viruses are present in this email,
> the company cannot accept responsibility for any loss or damage arising from
> the use of this email or attachments.
> This disclaimer was added by Policy Patrol: http://www.policypatrol.com/
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems