Hi, On Wed, Mar 12, 2008 at 04:44:08PM +0100, Jerome Caffet wrote: > Hi > > I am a 2 node cluster (node1 and node2) with simple configuration version 1 > node1 is primary. > > ha.cf : > bcast eth1 > baud 19200 > serial /dev/ttyS0 > debugfile /var/log/heartbeat-debug.log > logfile /var/log/heartbeat.log > logfacility local0 > keepalive 2 > deadtime 10 > warntime 6 > initdead 20 > udpport 1001 > node node1 node2 > auto_failback off > > haresources : > node1 Filesystem::/dev/emcpowera::/data::ext3 mysqld > IPaddr::xxx.xxx.xxx.xxx ldap jboss httpd > > > Last week, node1 broke down and node2 took the service. It worked well. > When node1 booted, the services stayed on node2 because of "auto_failback > off" (it is what we want). > > But the check once again node1, I rebooted it and I lost my services on > node2. > In fact, node1 sent a message to say he stopped : > > heartbeat[5028]: 2008/03/07_10:47:40 info: Heartbeat shutdown in progress. > (5028) > heartbeat[17733]: 2008/03/07_10:47:40 info: Giving up all HA resources. > ResourceManager[17746]: 2008/03/07_10:47:40 info: Releasing resource group: > node1 Filesystem::/dev/emcpowera1::/data::ext3 IPaddr::xxx.xxx.xxx.xxx > mysqld ldap jboss httpd > > On node2 : > > heartbeat[4975]: 2008/03/07_10:47:41 info: Received shutdown notice from > 'node2'. > heartbeat[4975]: 2008/03/07_10:47:41 info: Resources being acquired from > node2. > heartbeat[30169]: 2008/03/07_10:47:41 debug: notify_world: setting SIGCHLD > Handler to SIG_DFL > heartbeat[30170]: 2008/03/07_10:47:41 info: No local resources > [/opt/heartbeat/share/heartbeat/ResourceManager listkeys node2] to acquire. > harc[30169]: 2008/03/07_10:47:41 info: Running /etc/ha.d/rc.d/status > status > heartbeat[4975]: 2008/03/07_10:47:41 debug: StartNextRemoteRscReq(): child > count 1 > mach_down[30198]: 2008/03/07_10:47:41 info: Taking over resource > group Filesystem::/dev/emcpowera1::/data::ext3 > ResourceManager[30224]: 2008/03/07_10:47:41 info: Acquiring resource group: > node1 Filesystem::/dev/emcpowera1::/data::ext3 IPaddr::xxx.xxx.xxx.xxx > mysqld ldap jboss httpd > Filesystem[30252]: 2008/03/07_10:47:42 INFO: Running OK > IPaddr[30312]: 2008/03/07_10:47:42 INFO: Running OK > > But because ldap was already up, the return code was 1 and so heartbeat > decided to stop all the service on node2.
You'll have to fix ldap. A resource agent must return success on start of an already started resource. > So why nodeb decided to start ALREADY running services ? > How the avoid this case ? Don't know, but it is something which may happen under other circumstances as well. That's why resource agents have to handle double starts properly. You should really test all your resource agents thoroughly, under various circumstances. That's the foundation of a cluster. If they don't behave then all bets are off. Thanks, Dejan > Thanks in advance > > Jerome > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
