> From: Dejan Muhamedagic <[EMAIL PROTECTED]> > Subject: Re: [Linux-HA] Takeover: Application starts twice > > On Thu, Nov 29, 2007 at 03:58:33PM +0100, Burkhard Schultheis wrote: >> We have an old installation of heartbeat running on SuSE 9.0. heartbeat >> version is 1.2.3. >> >> Normal start is OK. But we tested a takeover. We shut down the active >> node. Then the application was started twice on the second node. >> >> In messages I found this: >> >> Nov 28 14:22:37 lechz1 ipfail[1456]: debug: Other side is unstable. >> Nov 28 14:22:39 lechz1 heartbeat[1424]: info: Received shutdown notice >> from 'lechz2'. >> Nov 28 14:22:39 lechz1 heartbeat[1424]: info: Resources being acquired >> from lechz2. >> Nov 28 14:22:39 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq(): >> child count 1 >> Nov 28 14:22:39 lechz1 heartbeat[1460]: info: acquire all HA resources >> (standby). >> Nov 28 14:22:40 lechz1 heartbeat: info: Acquiring resource group: lechz1 >> 192.168.7.199 telematx.start.communication >> Nov 28 14:22:40 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq(): >> child count 2 >> Nov 28 14:22:40 lechz1 heartbeat[1461]: info: Local Resource acquisition >> completed. >> Nov 28 14:22:40 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq(): >> child count 1 >> Nov 28 14:22:40 lechz1 heartbeat: info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.7.199 start >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting >> /etc/ha.d/resource.d/IPaddr 192.168.7.199 start >> Nov 28 14:22:40 lechz1 heartbeat: info: /home/lzgneu/bin/ifconfig eth0:0 >> 192.168.7.199 netmask 255.255.255.0 broadcast 192.168.7.255 >> Nov 28 14:22:40 lechz1 heartbeat: info: Sending Gratuitous Arp for >> 192.168.7.199 on eth0:0 [eth0] >> Nov 28 14:22:40 lechz1 heartbeat: /usr/lib/heartbeat/send_arp -i 1010 -r >> 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.7.199 eth0 >> 192.168.7.199 auto 192.168.7.199 ffffffffffff >> Nov 28 14:22:40 lechz1 heartbeat: debug: /etc/ha.d/resource.d/IPaddr >> 192.168.7.199 start done. RC=0 >> Nov 28 14:22:40 lechz1 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET) >> Nov 28 14:22:40 lechz1 heartbeat: info: Running >> /etc/ha.d/resource.d/telematx.start.communication start >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting >> /etc/ha.d/resource.d/telematx.start.communication start >> Nov 28 14:22:40 lechz1 heartbeat: debug: >> /etc/ha.d/resource.d/telematx.start.communication start done. RC=0 >> Nov 28 14:22:40 lechz1 heartbeat[1460]: info: all HA resource >> acquisition completed (standby). >> Nov 28 14:22:40 lechz1 heartbeat[1424]: info: Standby resource >> acquisition done [all]. >> Nov 28 14:22:40 lechz1 heartbeat[1659]: debug: notify_world: setting >> SIGCHLD Handler to SIG_DFL >> Nov 28 14:22:40 lechz1 heartbeat: info: Running /etc/ha.d/rc.d/status status >> Nov 28 14:22:40 lechz1 heartbeat: info: /usr/lib/heartbeat/mach_down: >> nice_failback: foreign resources acquired >> Nov 28 14:22:40 lechz1 heartbeat[1424]: info: mach_down takeover complete. >> Nov 28 14:22:40 lechz1 heartbeat: info: mach_down takeover complete for >> node lechz2. >> Nov 28 14:22:40 lechz1 heartbeat[1682]: debug: notify_world: setting >> SIGCHLD Handler to SIG_DFL >> Nov 28 14:22:40 lechz1 heartbeat: info: Running >> /etc/ha.d/rc.d/ip-request-resp ip-request-resp >> Nov 28 14:22:40 lechz1 heartbeat: received ip-request-resp 192.168.7.199 >> OK yes >> Nov 28 14:22:40 lechz1 heartbeat: info: Acquiring resource group: lechz1 >> 192.168.7.199 telematx.start.communication >> Nov 28 14:22:40 lechz1 su: (to lzgneu) root on none >> Nov 28 14:22:40 lechz1 su: pam_unix2: session started for user lzgneu, >> service su >> Nov 28 14:22:40 lechz1 heartbeat: info: Running >> /etc/ha.d/resource.d/telematx.start.communication start >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting >> /etc/ha.d/resource.d/telematx.start.communication start >> Nov 28 14:22:40 lechz1 su: (to lzgneu) root on none >> Nov 28 14:22:40 lechz1 su: pam_unix2: session started for user lzgneu, >> service su >> Nov 28 14:22:40 lechz1 heartbeat: debug: >> /etc/ha.d/resource.d/telematx.start.communication start done. RC=0 >> >> As you can see, telematx.start.communication starts twice in the same >> second. Where should I look for a configuration error? > > There is none :)
No! The application was indeed running two times, which is really bad! > I really don't know why, but there are two > messages from heartbeat: one at the info level and the other at > the debug level (take a closer look). At any rate, resource > agents should be able to handle starts in the running status. Should be able. :-( Regards, Burkhard _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
