On Wed, 5 Aug 2015 16:37:39 +0300 Andrei Borzenkov <arvidj...@gmail.com> wrote:
> On Wed, Aug 5, 2015 at 4:04 PM, Jehan-Guillaume de Rorthais > <j...@dalibo.com> wrote: > > hi guys, > > > > We are still on our new postgresql resource agent. > > > > We kind of make our minds with the promotion issue (see ml thread "problem > > with master score limited to 1000000") and found an acceptable algorithm. > > > > Now we are testing this RA, I found a strange behavior of the CRM with a > > simple failure scenario: The master resource is stopped. > > > > When I stop gracefully the master, > > You mean - stop postgres outside of pacemaker? Yes, to simulate a resource failure. > > the CRM tries to recover > > the resource with : > > > > * demote it > > * stop it > > * start it > > * promote it > > > > Sounds logic, but it fails at the first step because the master is actually > > stopped. According to the "ra-dev-guide", the RA should returns > > OCF_ERR_GENERIC if the resource is stopped on demote. See: > > > > http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html > > > > When teaching my RA to follow this, the CRM keep trying the same transition > > again and again until the failcount reaches the migration-threshold. Then it > > stops trying to recover it and moves the resource to another node. > > > > Same result if the RA returns OCF_NOT_RUNNING from the demote action > > instead of OCF_ERR_GENERIC. > > > > I could try to obey the CRM and start the resource as a slave and > > return OCF_SUCCESS, but it sounds ridiculous as it will be stopped at the > > really next step, then start again one step later... > > > > Did I missed something? Is this behavior normal? Any advise to fix this? -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com _______________________________________________ Developers mailing list Developers@clusterlabs.org http://clusterlabs.org/mailman/listinfo/developers