Hi, On Fri, Nov 30, 2007 at 09:08:50AM +0100, Burkhard Schultheis wrote: > > From: Dejan Muhamedagic <[EMAIL PROTECTED]> > > Subject: Re: [Linux-HA] Takeover: Application starts twice > > > > On Thu, Nov 29, 2007 at 03:58:33PM +0100, Burkhard Schultheis wrote: > >> We have an old installation of heartbeat running on SuSE 9.0. heartbeat > >> version is 1.2.3. > >> > >> Normal start is OK. But we tested a takeover. We shut down the active > >> node. Then the application was started twice on the second node. > >> > >> In messages I found this: > >> > >> Nov 28 14:22:37 lechz1 ipfail[1456]: debug: Other side is unstable. > >> Nov 28 14:22:39 lechz1 heartbeat[1424]: info: Received shutdown notice > >> from 'lechz2'. > >> Nov 28 14:22:39 lechz1 heartbeat[1424]: info: Resources being acquired > >> from lechz2. > >> Nov 28 14:22:39 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq(): > >> child count 1 > >> Nov 28 14:22:39 lechz1 heartbeat[1460]: info: acquire all HA resources > >> (standby). > >> Nov 28 14:22:40 lechz1 heartbeat: info: Acquiring resource group: lechz1 > >> 192.168.7.199 telematx.start.communication > >> Nov 28 14:22:40 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq(): > >> child count 2 > >> Nov 28 14:22:40 lechz1 heartbeat[1461]: info: Local Resource acquisition > >> completed. > >> Nov 28 14:22:40 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq(): > >> child count 1 > >> Nov 28 14:22:40 lechz1 heartbeat: info: Running > >> /etc/ha.d/resource.d/IPaddr 192.168.7.199 start > >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting > >> /etc/ha.d/resource.d/IPaddr 192.168.7.199 start > >> Nov 28 14:22:40 lechz1 heartbeat: info: /home/lzgneu/bin/ifconfig eth0:0 > >> 192.168.7.199 netmask 255.255.255.0 broadcast 192.168.7.255 > >> Nov 28 14:22:40 lechz1 heartbeat: info: Sending Gratuitous Arp for > >> 192.168.7.199 on eth0:0 [eth0] > >> Nov 28 14:22:40 lechz1 heartbeat: /usr/lib/heartbeat/send_arp -i 1010 -r > >> 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.7.199 eth0 > >> 192.168.7.199 auto 192.168.7.199 ffffffffffff > >> Nov 28 14:22:40 lechz1 heartbeat: debug: /etc/ha.d/resource.d/IPaddr > >> 192.168.7.199 start done. RC=0 > >> Nov 28 14:22:40 lechz1 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET) > >> Nov 28 14:22:40 lechz1 heartbeat: info: Running > >> /etc/ha.d/resource.d/telematx.start.communication start > >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting > >> /etc/ha.d/resource.d/telematx.start.communication start > >> Nov 28 14:22:40 lechz1 heartbeat: debug: > >> /etc/ha.d/resource.d/telematx.start.communication start done. RC=0 > >> Nov 28 14:22:40 lechz1 heartbeat[1460]: info: all HA resource > >> acquisition completed (standby). > >> Nov 28 14:22:40 lechz1 heartbeat[1424]: info: Standby resource > >> acquisition done [all]. > >> Nov 28 14:22:40 lechz1 heartbeat[1659]: debug: notify_world: setting > >> SIGCHLD Handler to SIG_DFL > >> Nov 28 14:22:40 lechz1 heartbeat: info: Running /etc/ha.d/rc.d/status > >> status > >> Nov 28 14:22:40 lechz1 heartbeat: info: /usr/lib/heartbeat/mach_down: > >> nice_failback: foreign resources acquired > >> Nov 28 14:22:40 lechz1 heartbeat[1424]: info: mach_down takeover complete. > >> Nov 28 14:22:40 lechz1 heartbeat: info: mach_down takeover complete for > >> node lechz2. > >> Nov 28 14:22:40 lechz1 heartbeat[1682]: debug: notify_world: setting > >> SIGCHLD Handler to SIG_DFL > >> Nov 28 14:22:40 lechz1 heartbeat: info: Running > >> /etc/ha.d/rc.d/ip-request-resp ip-request-resp > >> Nov 28 14:22:40 lechz1 heartbeat: received ip-request-resp 192.168.7.199 > >> OK yes > >> Nov 28 14:22:40 lechz1 heartbeat: info: Acquiring resource group: lechz1 > >> 192.168.7.199 telematx.start.communication > >> Nov 28 14:22:40 lechz1 su: (to lzgneu) root on none > >> Nov 28 14:22:40 lechz1 su: pam_unix2: session started for user lzgneu, > >> service su > >> Nov 28 14:22:40 lechz1 heartbeat: info: Running > >> /etc/ha.d/resource.d/telematx.start.communication start > >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting > >> /etc/ha.d/resource.d/telematx.start.communication start > >> Nov 28 14:22:40 lechz1 su: (to lzgneu) root on none > >> Nov 28 14:22:40 lechz1 su: pam_unix2: session started for user lzgneu, > >> service su > >> Nov 28 14:22:40 lechz1 heartbeat: debug: > >> /etc/ha.d/resource.d/telematx.start.communication start done. RC=0 > >> > >> As you can see, telematx.start.communication starts twice in the same > >> second. Where should I look for a configuration error? > > > > There is none :) > > No! The application was indeed running two times, which is really bad!
Yes, you're right. I missed one. It seems to be related to ip-request-resp. Strange. > > I really don't know why, but there are two > > messages from heartbeat: one at the info level and the other at > > the debug level (take a closer look). At any rate, resource > > agents should be able to handle starts in the running status. > > Should be able. :-( That's a requirement. See http://www.linux-ha.org/HeartbeatResourceAgent Thanks, Dejan > > Regards, > Burkhard > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
