Hi, I'm curious how the "on-fail" attribute of a recurring monitor operation works. From my testing, it seems that a recurring monitor is considered to have failed any time its return doesn't match what the cluster believes it should be. That is, if the resource is supposed to be running, and the monitor returns with anything other than OCF_SUCCESS, then the on-fail action will be taken. Is my understanding of this correct?
If so, is it possible to have different fail actions depending on the sort of failure? Specifically, I'm looking to do a on-fail="restart" if the RA comes back with OCF_NOT_RUNNING when Pacemaker believes that the resource should be running. However, I'd like on-fail="ignore" if the recurring monitor operation comes back with OCF_ERR_GENERIC, or times out. To explain -- I'm working with a resource that is adjusted by RPCs (still those blasted AWS elastic IPs). On occasion, the external API may fail, which will be surfaced to Pacemaker as a OCF_ERR_GENERIC or a timeout. A transient API failure isn't itself a cause for alarm -- the cluster should simply assume that it has the correct view of the universe until the API becomes available again. However, if the external API indicates with certainty that the resource is down when Pacemaker believes it should be up, then we should take corrective action immediately. Thanks, Andrew _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
