Hi,

On Fri, Nov 30, 2007 at 09:08:50AM +0100, Burkhard Schultheis wrote:
> > From: Dejan Muhamedagic <[EMAIL PROTECTED]>
> > Subject: Re: [Linux-HA] Takeover: Application starts twice
> > 
> > On Thu, Nov 29, 2007 at 03:58:33PM +0100, Burkhard Schultheis wrote:
> >> We have an old installation of heartbeat running on SuSE 9.0. heartbeat
> >> version is 1.2.3.
> >>
> >> Normal start is OK. But we tested a takeover. We shut down the active
> >> node. Then the application was started twice on the second node.
> >>
> >> In messages I found this:
> >>
> >> Nov 28 14:22:37 lechz1 ipfail[1456]: debug: Other side is unstable.
> >> Nov 28 14:22:39 lechz1 heartbeat[1424]: info: Received shutdown notice
> >> from 'lechz2'.
> >> Nov 28 14:22:39 lechz1 heartbeat[1424]: info: Resources being acquired
> >> from lechz2.
> >> Nov 28 14:22:39 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq():
> >> child count 1
> >> Nov 28 14:22:39 lechz1 heartbeat[1460]: info: acquire all HA resources
> >> (standby).
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Acquiring resource group: lechz1
> >> 192.168.7.199 telematx.start.communication
> >> Nov 28 14:22:40 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq():
> >> child count 2
> >> Nov 28 14:22:40 lechz1 heartbeat[1461]: info: Local Resource acquisition
> >> completed.
> >> Nov 28 14:22:40 lechz1 heartbeat[1424]: debug: StartNextRemoteRscReq():
> >> child count 1
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Running
> >> /etc/ha.d/resource.d/IPaddr 192.168.7.199 start
> >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting
> >> /etc/ha.d/resource.d/IPaddr 192.168.7.199 start
> >> Nov 28 14:22:40 lechz1 heartbeat: info: /home/lzgneu/bin/ifconfig eth0:0
> >> 192.168.7.199 netmask 255.255.255.0        broadcast 192.168.7.255
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Sending Gratuitous Arp for
> >> 192.168.7.199 on eth0:0 [eth0]
> >> Nov 28 14:22:40 lechz1 heartbeat: /usr/lib/heartbeat/send_arp -i 1010 -r
> >> 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.7.199 eth0
> >> 192.168.7.199 auto 192.168.7.199 ffffffffffff
> >> Nov 28 14:22:40 lechz1 heartbeat: debug: /etc/ha.d/resource.d/IPaddr
> >> 192.168.7.199 start done. RC=0
> >> Nov 28 14:22:40 lechz1 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET)
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Running
> >> /etc/ha.d/resource.d/telematx.start.communication  start
> >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting
> >> /etc/ha.d/resource.d/telematx.start.communication  start
> >> Nov 28 14:22:40 lechz1 heartbeat: debug:
> >> /etc/ha.d/resource.d/telematx.start.communication  start done. RC=0
> >> Nov 28 14:22:40 lechz1 heartbeat[1460]: info: all HA resource
> >> acquisition completed (standby).
> >> Nov 28 14:22:40 lechz1 heartbeat[1424]: info: Standby resource
> >> acquisition done [all].
> >> Nov 28 14:22:40 lechz1 heartbeat[1659]: debug: notify_world: setting
> >> SIGCHLD Handler to SIG_DFL
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Running /etc/ha.d/rc.d/status 
> >> status
> >> Nov 28 14:22:40 lechz1 heartbeat: info: /usr/lib/heartbeat/mach_down:
> >> nice_failback: foreign resources acquired
> >> Nov 28 14:22:40 lechz1 heartbeat[1424]: info: mach_down takeover complete.
> >> Nov 28 14:22:40 lechz1 heartbeat: info: mach_down takeover complete for
> >> node lechz2.
> >> Nov 28 14:22:40 lechz1 heartbeat[1682]: debug: notify_world: setting
> >> SIGCHLD Handler to SIG_DFL
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Running
> >> /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> >> Nov 28 14:22:40 lechz1 heartbeat: received ip-request-resp 192.168.7.199
> >> OK yes
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Acquiring resource group: lechz1
> >> 192.168.7.199 telematx.start.communication
> >> Nov 28 14:22:40 lechz1 su: (to lzgneu) root on none
> >> Nov 28 14:22:40 lechz1 su: pam_unix2: session started for user lzgneu,
> >> service su
> >> Nov 28 14:22:40 lechz1 heartbeat: info: Running
> >> /etc/ha.d/resource.d/telematx.start.communication  start
> >> Nov 28 14:22:40 lechz1 heartbeat: debug: Starting
> >> /etc/ha.d/resource.d/telematx.start.communication  start
> >> Nov 28 14:22:40 lechz1 su: (to lzgneu) root on none
> >> Nov 28 14:22:40 lechz1 su: pam_unix2: session started for user lzgneu,
> >> service su
> >> Nov 28 14:22:40 lechz1 heartbeat: debug:
> >> /etc/ha.d/resource.d/telematx.start.communication  start done. RC=0
> >>
> >> As you can see, telematx.start.communication starts twice in the same
> >> second. Where should I look for a configuration error?
> > 
> > There is none :)
> 
> No! The application was indeed running two times, which is really bad!

Yes, you're right. I missed one. It seems to be related to
ip-request-resp. Strange.

> >  I really don't know why, but there are two
> > messages from heartbeat: one at the info level and the other at
> > the debug level (take a closer look). At any rate, resource
> > agents should be able to handle starts in the running status.
> 
> Should be able. :-(

That's a requirement. See

http://www.linux-ha.org/HeartbeatResourceAgent

Thanks,

Dejan

> 
> Regards,
> Burkhard
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to