Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
On 08/27/2015 03:04 AM, Tom Yates wrote: > On Mon, 24 Aug 2015, Andrei Borzenkov wrote: > >> 24.08.2015 13:32, Tom Yates пишет: >>> if i understand you aright, my problem is that the stop script didn't >>> return a 0 (OK) exit status, so CRM didn't know where to go. is the >>> exit status of the stop script how CRM determines the status of the >>> stop >>> operation? >> >> correct >> >>> does CRM also use the output of "/etc/init.d/script status" to >>> determine >>> continuing successful operation? >> >> It definitely does not use *output* of script - only return code. If >> the question is whether it probes resource additionally to checking >> stop exit code - I do not think so (I know it does it in some cases >> for systemd resources). > > i just thought i'd come back and follow-up. in testing this morning, i > can confirm that the "pppoe-stop" command returns status 1 if pppd isn't > running. that makes a standard init.d script, which passes on the > return code of the stop command, unhelpful to CRM. > > i changed the script so that on stop, having run pppoe-stop, it checks > for the existence of a working ppp0 interface, and returns 0 IFO there > is none. Nice >> If resource was previously active and stop was attempted as cleanup >> after resource failure - yes, it should attempt to start it again. > > that is now what happens. it seems to try three time to bring up pppd, > then kicks the service over to the other node. > > in the case of extended outages (ie, the ISP goes away for more than > about 10 minutes), where both nodes have time to fail, we end up back in > the bad old state (service failed on both nodes): > > [root@positron ~]# crm status > [...] > Online: [ electron positron ] > > Resource Group: BothIPs > InternalIP (ocf::heartbeat:IPaddr):Started electron > ExternalIP (lsb:hb-adsl-helper): Stopped > > Failed actions: > ExternalIP_monitor_6 (node=positron, call=15, rc=7, > status=complete): not running > ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed > Out): unknown exec error > ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): > unknown exec error > > is there any way to configure CRM to keep kicking the service between > the two nodes forever (ie, try three times on positron, kick service > group to electron, try three times on electron, kick back to positron, > lather rinse repeat...)? > > for a service like DSL, which can go away for extended periods through > no local fault then suddenly and with no announcement come back, this > would be most useful behaviour. Yes, see migration-threshold and failure-timeout. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options > thanks to all for help with this. thanks also to those who have > suggested i rewrite this as an OCF agent (especially to ken gaillot who > was kind enough to point me to documentation); i will look at that if > time permits. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
On Mon, 24 Aug 2015, Andrei Borzenkov wrote: 24.08.2015 13:32, Tom Yates пишет: if i understand you aright, my problem is that the stop script didn't return a 0 (OK) exit status, so CRM didn't know where to go. is the exit status of the stop script how CRM determines the status of the stop operation? correct does CRM also use the output of "/etc/init.d/script status" to determine continuing successful operation? It definitely does not use *output* of script - only return code. If the question is whether it probes resource additionally to checking stop exit code - I do not think so (I know it does it in some cases for systemd resources). i just thought i'd come back and follow-up. in testing this morning, i can confirm that the "pppoe-stop" command returns status 1 if pppd isn't running. that makes a standard init.d script, which passes on the return code of the stop command, unhelpful to CRM. i changed the script so that on stop, having run pppoe-stop, it checks for the existence of a working ppp0 interface, and returns 0 IFO there is none. If resource was previously active and stop was attempted as cleanup after resource failure - yes, it should attempt to start it again. that is now what happens. it seems to try three time to bring up pppd, then kicks the service over to the other node. in the case of extended outages (ie, the ISP goes away for more than about 10 minutes), where both nodes have time to fail, we end up back in the bad old state (service failed on both nodes): [root@positron ~]# crm status [...] Online: [ electron positron ] Resource Group: BothIPs InternalIP (ocf::heartbeat:IPaddr):Started electron ExternalIP (lsb:hb-adsl-helper): Stopped Failed actions: ExternalIP_monitor_6 (node=positron, call=15, rc=7, status=complete): not running ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed Out): unknown exec error ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): unknown exec error is there any way to configure CRM to keep kicking the service between the two nodes forever (ie, try three times on positron, kick service group to electron, try three times on electron, kick back to positron, lather rinse repeat...)? for a service like DSL, which can go away for extended periods through no local fault then suddenly and with no announcement come back, this would be most useful behaviour. thanks to all for help with this. thanks also to those who have suggested i rewrite this as an OCF agent (especially to ken gaillot who was kind enough to point me to documentation); i will look at that if time permits. -- Tom Yates - http://www.teaparty.net___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
On 08/24/2015 04:52 AM, Andrei Borzenkov wrote: > 24.08.2015 12:35, Tom Yates пишет: >> I've got a failover firewall pair where the external interface is ADSL; >> that is, PPPoE. i've defined the service thus: >> >> primitive ExternalIP lsb:hb-adsl-helper \ >> op monitor interval="60s" >> >> and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus: >> >> #!/bin/bash >> RETVAL=0 >> start() { >> /sbin/pppoe-start >> } >> stop() { >> /sbin/pppoe-stop >> } >> case "$1" in >>start) >> start >> ;; >>stop) >> stop >> ;; >>status) >> /sbin/ifconfig ppp0 >& /dev/null && exit 0 >> exit 1 >> ;; >>*) >> echo $"Usage: $0 {start|stop|status}" >> exit 3 >> esac >> exit $? Pacemaker expects that LSB agents follow the LSB spec for return codes, and won't be able to behave properly if they don't: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-lsb However it's just as easy to write an OCF agent, which gives you more flexibility (accepting parameters, etc.): http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf >> The problem is that sometimes the ADSL connection falls over, as they >> do, eg: >> >> Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer >> Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes. >> Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received >> 164420300 bytes. >> Aug 20 11:42:13 positron pppd[2469]: Connection terminated. >> Aug 20 11:42:13 positron pppd[2469]: Modem hangup >> Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session >> 1735: Input/output error >> Aug 20 11:42:13 positron pppoe[2470]: Sent PADT >> Aug 20 11:42:13 positron pppd[2469]: Exit. >> Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; >> attempting re-connection. >> >> CRMd then logs a bunch of stuff, followed by >> >> Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop >> Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no >> additional parameters are needed. >> [...] >> Aug 20 11:42:18 positron pppoe-stop: Killing pppd >> Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect >> Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop >> process 28357 exited with return code 1. >> >> >> At this point, the PPPoE connection is down, and stays down. CRMd >> doesn't fail the group which contains both internal and external >> interfaces over to the other node, but nor does it try to restart the >> service. I'm fairly sure this is because I've done something >> boneheaded, but I can't get my bone head around what it might be. >> >> Any light anyone can shed is much appreciated. >> >> > > If stop operation failed resource state is undefined; pacemaker won't do > anything with this resource. Either make sure script returns success > when appropriate or the only option is to make it fence node where > resource was active. > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
24.08.2015 13:32, Tom Yates пишет: On Mon, 24 Aug 2015, Andrei Borzenkov wrote: 24.08.2015 12:35, Tom Yates пишет: I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: If stop operation failed resource state is undefined; pacemaker won't do anything with this resource. Either make sure script returns success when appropriate or the only option is to make it fence node where resource was active. andrei, thank you for your prompt and helpful response. if i understand you aright, my problem is that the stop script didn't return a 0 (OK) exit status, so CRM didn't know where to go. is the exit status of the stop script how CRM determines the status of the stop operation? correct and if that gives exit code 0, it will then try to do a "/etc/init.d/script start"? If resource was previously active and stop was attempted as cleanup after resource failure - yes, it should attempt to start it again. does CRM also use the output of "/etc/init.d/script status" to determine continuing successful operation? It definitely does not use *output* of script - only return code. If the question is whether it probes resource additionally to checking stop exit code - I do not think so (I know it does it in some cases for systemd resources). ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
On Mon, 24 Aug 2015, Andrei Borzenkov wrote: 24.08.2015 12:35, Tom Yates пишет: I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: If stop operation failed resource state is undefined; pacemaker won't do anything with this resource. Either make sure script returns success when appropriate or the only option is to make it fence node where resource was active. andrei, thank you for your prompt and helpful response. if i understand you aright, my problem is that the stop script didn't return a 0 (OK) exit status, so CRM didn't know where to go. is the exit status of the stop script how CRM determines the status of the stop operation? and if that gives exit code 0, it will then try to do a "/etc/init.d/script start"? does CRM also use the output of "/etc/init.d/script status" to determine continuing successful operation? -- Tom Yates - http://www.teaparty.net___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
24.08.2015 12:35, Tom Yates пишет: I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: primitive ExternalIP lsb:hb-adsl-helper \ op monitor interval="60s" and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus: #!/bin/bash RETVAL=0 start() { /sbin/pppoe-start } stop() { /sbin/pppoe-stop } case "$1" in start) start ;; stop) stop ;; status) /sbin/ifconfig ppp0 >& /dev/null && exit 0 exit 1 ;; *) echo $"Usage: $0 {start|stop|status}" exit 3 esac exit $? The problem is that sometimes the ADSL connection falls over, as they do, eg: Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes. Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received 164420300 bytes. Aug 20 11:42:13 positron pppd[2469]: Connection terminated. Aug 20 11:42:13 positron pppd[2469]: Modem hangup Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session 1735: Input/output error Aug 20 11:42:13 positron pppoe[2470]: Sent PADT Aug 20 11:42:13 positron pppd[2469]: Exit. Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; attempting re-connection. CRMd then logs a bunch of stuff, followed by Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no additional parameters are needed. [...] Aug 20 11:42:18 positron pppoe-stop: Killing pppd Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop process 28357 exited with return code 1. At this point, the PPPoE connection is down, and stays down. CRMd doesn't fail the group which contains both internal and external interfaces over to the other node, but nor does it try to restart the service. I'm fairly sure this is because I've done something boneheaded, but I can't get my bone head around what it might be. Any light anyone can shed is much appreciated. If stop operation failed resource state is undefined; pacemaker won't do anything with this resource. Either make sure script returns success when appropriate or the only option is to make it fence node where resource was active. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] CRM managing ADSL connection; failure not handled
I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: primitive ExternalIP lsb:hb-adsl-helper \ op monitor interval="60s" and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus: #!/bin/bash RETVAL=0 start() { /sbin/pppoe-start } stop() { /sbin/pppoe-stop } case "$1" in start) start ;; stop) stop ;; status) /sbin/ifconfig ppp0 >& /dev/null && exit 0 exit 1 ;; *) echo $"Usage: $0 {start|stop|status}" exit 3 esac exit $? The problem is that sometimes the ADSL connection falls over, as they do, eg: Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes. Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received 164420300 bytes. Aug 20 11:42:13 positron pppd[2469]: Connection terminated. Aug 20 11:42:13 positron pppd[2469]: Modem hangup Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session 1735: Input/output error Aug 20 11:42:13 positron pppoe[2470]: Sent PADT Aug 20 11:42:13 positron pppd[2469]: Exit. Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; attempting re-connection. CRMd then logs a bunch of stuff, followed by Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no additional parameters are needed. [...] Aug 20 11:42:18 positron pppoe-stop: Killing pppd Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop process 28357 exited with return code 1. At this point, the PPPoE connection is down, and stays down. CRMd doesn't fail the group which contains both internal and external interfaces over to the other node, but nor does it try to restart the service. I'm fairly sure this is because I've done something boneheaded, but I can't get my bone head around what it might be. Any light anyone can shed is much appreciated. -- Tom Yates - http://www.teaparty.net ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org