Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
On 01/26/2016 05:06 AM, Joakim Hansson wrote: > Thanks for the help guys. > I ended up patching together my own RA from the Delay and Dummy RA's and > using curl to request the header of solr's ping request handler on > localhost, which made the resource start return a bit more dynamic. > However, now I have another problem which I don't think is related to my RA. > For some reason when failing over the nodes, the ClusterIP (vIP below) > seems to avoid the node running the fencing agent: > > pcs status > > Online: [ node01 node02 ] > OFFLINE: [ node03 ] > > Full list of resources: > > VMWare-fence (stonith:fence_vmware_soap):Started node02 > Clone Set: dlm-clone [dlm] > Started: [ node01 node02 ] > Stopped: [ node03 ] > Clone Set: GFS2-clone [GFS2] (unique) > GFS2:0 (ocf::heartbeat:Filesystem):Started node01 > GFS2:1 (ocf::heartbeat:Filesystem):Stopped > GFS2:2 (ocf::heartbeat:Filesystem):Started node02 > Clone Set: Tomcat-clone [Tomcat] > Started: [ node02 ] > Stopped: [ node01 node03 ] > vIP(ocf::heartbeat:IPaddr2): Stopped > > Notice how the tomcat-clone is started on node02 but the vIP remains > stopped. > If I start the fence agent on any of the other nodes the same thing happens > (ie, vIP avoiding the fencing node) > Any idea why this happens? > > Output of 'pcs config show': > https://github.com/apepojken/pacemaker/blob/master/Config I notice you have mutliple ordering constraints but only one colocation constraint. That means, for example, that tomcat-clone must be started after GFS2, but it does not have to be on the same node. I'm pretty sure you want colocation constraints as well, to make them start on the same node. FYI, a group is like a shorthand for ordering and constraint constraints for multiple resources that need to be kept together and started/stopped in order. I also see you have globally-unique=true on GFS2-clone. You probably do not want this. globally-unique=false (the default) is more common, and means that all clone instances are interchangeable, and is usually configured with clone-node-max=1, because only one instance is ever needed on any one node. globally-unique=true means that each clone instance handles a different subset of requests, and is usually configured with clone-node-max > 1 so that multiple clone instances can run on a single node if needed. I don't see from that alone why vIP wouldn't start, but take care of the above issues first, and see what the behavior is then. > Thanks again! > > 2016-01-20 1:14 GMT+01:00 Jan Pokorný : > >> On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote: >>> Joakim Hansson writes: When adding the Delay RA it starts throwing a bunch of errors and the cluster starts fencing the nodes one by one. The error's I get with "pcs status": Failed Actions: * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed >> Out, exit reason='none', last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed >> Out, exit reason='none', last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed >> Out, exit reason='none', last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms and in the /var/log/pacemaker.log: >> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay I added the Delay RA with: pcs resource create Delay ocf:heartbeat:Delay \ startdelay="120" meta target-role=Started \ op start timeout="180" and my config looks like this: https://github.com/apepojken/pacemaker/blob/master/Config Am I missing something obvious here? >>> >>> It looks like you have a monitor operation configured for the Delay >>> resource, but you haven't set the mondelay parameter. But either way, >>> there is no reason to monitor the Delay resource, so remove that. Same >>> thing for the stop operation, just remove it. >>> >>> I'm guessing pcs adds these by default. >> >> It's true that pcs adds equivalent of "op monitor interval=60s" >> as an unconditional fallback when defining a new resource. >> Other operations are driven solely by explicit values or by >> defaults for particular resource, and this can be turned off >> via "--no-default-ops" option to pcs. >> >> FWIW, this could be a way to have monitor explicitly deactivated: >> >> pcs resource create ... op monitor interval=0s >> >> -- >> Jan (Poki) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
Joakim Hansson writes: > Notice how the tomcat-clone is started on node02 but the vIP remains > stopped. > If I start the fence agent on any of the other nodes the same thing happens > (ie, vIP avoiding the fencing node) > Any idea why this happens? Hi, I'm not 100% sure this is what is causing the behaviour that you are seeing, but by default Pacemaker tries to balance the resources evenly across the cluster. You can influence this using placement scores on location constraints. -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
Thanks for the help guys. I ended up patching together my own RA from the Delay and Dummy RA's and using curl to request the header of solr's ping request handler on localhost, which made the resource start return a bit more dynamic. However, now I have another problem which I don't think is related to my RA. For some reason when failing over the nodes, the ClusterIP (vIP below) seems to avoid the node running the fencing agent: pcs status Online: [ node01 node02 ] OFFLINE: [ node03 ] Full list of resources: VMWare-fence (stonith:fence_vmware_soap):Started node02 Clone Set: dlm-clone [dlm] Started: [ node01 node02 ] Stopped: [ node03 ] Clone Set: GFS2-clone [GFS2] (unique) GFS2:0 (ocf::heartbeat:Filesystem):Started node01 GFS2:1 (ocf::heartbeat:Filesystem):Stopped GFS2:2 (ocf::heartbeat:Filesystem):Started node02 Clone Set: Tomcat-clone [Tomcat] Started: [ node02 ] Stopped: [ node01 node03 ] vIP(ocf::heartbeat:IPaddr2): Stopped Notice how the tomcat-clone is started on node02 but the vIP remains stopped. If I start the fence agent on any of the other nodes the same thing happens (ie, vIP avoiding the fencing node) Any idea why this happens? Output of 'pcs config show': https://github.com/apepojken/pacemaker/blob/master/Config Thanks again! 2016-01-20 1:14 GMT+01:00 Jan Pokorný : > On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote: > > Joakim Hansson writes: > >> When adding the Delay RA it starts throwing a bunch of errors and the > >> cluster starts fencing the nodes one by one. > >> > >> The error's I get with "pcs status": > >> > >> Failed Actions: > >> * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed > Out, > >> exit > >> reason='none', > >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms > >> * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed > Out, > >> exit > >> reason='none', > >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms > >> * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed > Out, > >> exit > >> reason='none', > >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms > >> > >> and in the /var/log/pacemaker.log: > >> > >> > https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay > >> > >> I added the Delay RA with: > >> > >> pcs resource create Delay ocf:heartbeat:Delay \ > >> startdelay="120" meta target-role=Started \ > >> op start timeout="180" > >> > >> and my config looks like this: > >> > >> https://github.com/apepojken/pacemaker/blob/master/Config > >> > >> Am I missing something obvious here? > > > > It looks like you have a monitor operation configured for the Delay > > resource, but you haven't set the mondelay parameter. But either way, > > there is no reason to monitor the Delay resource, so remove that. Same > > thing for the stop operation, just remove it. > > > > I'm guessing pcs adds these by default. > > It's true that pcs adds equivalent of "op monitor interval=60s" > as an unconditional fallback when defining a new resource. > Other operations are driven solely by explicit values or by > defaults for particular resource, and this can be turned off > via "--no-default-ops" option to pcs. > > FWIW, this could be a way to have monitor explicitly deactivated: > > pcs resource create ... op monitor interval=0s > > -- > Jan (Poki) > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote: > Joakim Hansson writes: >> When adding the Delay RA it starts throwing a bunch of errors and the >> cluster starts fencing the nodes one by one. >> >> The error's I get with "pcs status": >> >> Failed Actions: >> * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed Out, >> exit >> reason='none', >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms >> * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed Out, >> exit >> reason='none', >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms >> * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed Out, >> exit >> reason='none', >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms >> >> and in the /var/log/pacemaker.log: >> >> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay >> >> I added the Delay RA with: >> >> pcs resource create Delay ocf:heartbeat:Delay \ >> startdelay="120" meta target-role=Started \ >> op start timeout="180" >> >> and my config looks like this: >> >> https://github.com/apepojken/pacemaker/blob/master/Config >> >> Am I missing something obvious here? > > It looks like you have a monitor operation configured for the Delay > resource, but you haven't set the mondelay parameter. But either way, > there is no reason to monitor the Delay resource, so remove that. Same > thing for the stop operation, just remove it. > > I'm guessing pcs adds these by default. It's true that pcs adds equivalent of "op monitor interval=60s" as an unconditional fallback when defining a new resource. Other operations are driven solely by explicit values or by defaults for particular resource, and this can be turned off via "--no-default-ops" option to pcs. FWIW, this could be a way to have monitor explicitly deactivated: pcs resource create ... op monitor interval=0s -- Jan (Poki) pgpUalyCO3_xr.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
Joakim Hansson writes: > > When adding the Delay RA it starts throwing a bunch of errors and the > cluster starts fencing the nodes one by one. > > The error's I get with "pcs status": > > Failed Actions: > * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed Out, > exit > reason='none', > last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms > * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed Out, > exit > reason='none', > last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms > * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed Out, > exit > reason='none', > last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms > > and in the /var/log/pacemaker.log: > > https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay > > I added the Delay RA with: > > pcs resource create Delay ocf:heartbeat:Delay \ > startdelay="120" meta target-role=Started \ > op start timeout="180" > > and my config looks like this: > > https://github.com/apepojken/pacemaker/blob/master/Config > > Am I missing something obvious here? Hi, It looks like you have a monitor operation configured for the Delay resource, but you haven't set the mondelay parameter. But either way, there is no reason to monitor the Delay resource, so remove that. Same thing for the stop operation, just remove it. I'm guessing pcs adds these by default. Cheers, Kristoffer > > Thanks again for all the help so far! > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
> > >> Hi, > >> > >> There is the ocf:heartbeat:Delay resource agent, which on one hand is > >> documented as a test resource, but on the other hand should do what you > >> need: > >> > >> primitive solr ... > >> primitive two-minute-delay ocf:heartbeat:Delay \ > >> params startdelay=120 meta target-role=Started \ > >> op start timeout=180 > >> group solr-then-wait solr two-minute-delay > >> > >> Now the group acts basically like the solr resource, except for the > >> two-minute delay after starting solr before the group itself is > >> considered started. > >> > >> Cheers, > >> Kristoffer > >> > >>> > >>> / Jocke > > > >Another way would be to customize the tomcat resource agent so that > >start doesn't return success until it's fully ready to accept requests > >(which would probably be specific to whatever app you're running via > >tomcat). Of course you'd need a long start timeout. > > Thanks for the tips guys! I'm using the systemd RA of tomcat (I know it's not recommended) and can't seem to figure out how to go about postponing the success return. Maybe I'll try the OCF one later. When adding the Delay RA it starts throwing a bunch of errors and the cluster starts fencing the nodes one by one. The error's I get with "pcs status": Failed Actions: * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed Out, exit reason='none', last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed Out, exit reason='none', last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed Out, exit reason='none', last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms and in the /var/log/pacemaker.log: https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay I added the Delay RA with: pcs resource create Delay ocf:heartbeat:Delay \ startdelay="120" meta target-role=Started \ op start timeout="180" and my config looks like this: https://github.com/apepojken/pacemaker/blob/master/Config Am I missing something obvious here? Thanks again for all the help so far! ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
On 01/12/2016 07:57 AM, Kristoffer Grönlund wrote: > Joakim Hansson writes: > >> Hi! >> I have a cluster running tomcat which in turn run solr. >> I use three nodes with loadbalancing via ipaddr2. >> The thing is, when tomcat is started on a node it takes about 2 minutes >> before solr is functioning correctly. >> >> Is there a way to make the ipaddr2-clone wait 2 minutes after tomcat is >> started before it moves the ip to the node? >> >> Much appreciated! > > Hi, > > There is the ocf:heartbeat:Delay resource agent, which on one hand is > documented as a test resource, but on the other hand should do what you > need: > > primitive solr ... > primitive two-minute-delay ocf:heartbeat:Delay \ > params startdelay=120 meta target-role=Started \ > op start timeout=180 > group solr-then-wait solr two-minute-delay > > Now the group acts basically like the solr resource, except for the > two-minute delay after starting solr before the group itself is > considered started. > > Cheers, > Kristoffer > >> >> / Jocke Another way would be to customize the tomcat resource agent so that start doesn't return success until it's fully ready to accept requests (which would probably be specific to whatever app you're running via tomcat). Of course you'd need a long start timeout. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wait until resource is really ready before moving clusterip
Joakim Hansson writes: > Hi! > I have a cluster running tomcat which in turn run solr. > I use three nodes with loadbalancing via ipaddr2. > The thing is, when tomcat is started on a node it takes about 2 minutes > before solr is functioning correctly. > > Is there a way to make the ipaddr2-clone wait 2 minutes after tomcat is > started before it moves the ip to the node? > > Much appreciated! Hi, There is the ocf:heartbeat:Delay resource agent, which on one hand is documented as a test resource, but on the other hand should do what you need: primitive solr ... primitive two-minute-delay ocf:heartbeat:Delay \ params startdelay=120 meta target-role=Started \ op start timeout=180 group solr-then-wait solr two-minute-delay Now the group acts basically like the solr resource, except for the two-minute delay after starting solr before the group itself is considered started. Cheers, Kristoffer > > / Jocke > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org