Hi,

I'm curious how the "on-fail" attribute of a recurring monitor  
operation works.  From my testing, it seems that a recurring monitor  
is considered to have failed any time its return doesn't match what  
the cluster believes it should be.  That is, if the resource is  
supposed to be running, and the monitor returns with anything other  
than OCF_SUCCESS, then the on-fail action will be taken.  Is my  
understanding of this correct?

If so, is it possible to have different fail actions depending on the  
sort of failure?  Specifically, I'm looking to do a on-fail="restart"  
if the RA comes back with OCF_NOT_RUNNING when Pacemaker believes that  
the resource should be running.  However, I'd like on-fail="ignore" if  
the recurring monitor operation comes back with OCF_ERR_GENERIC, or  
times out.

To explain -- I'm working with a resource that is adjusted by RPCs  
(still those blasted AWS elastic IPs).  On occasion, the external API  
may fail, which will be surfaced to Pacemaker as a OCF_ERR_GENERIC or  
a timeout.  A transient API failure isn't itself a cause for alarm --  
the cluster should simply assume that it has the correct view of the  
universe until the API becomes available again.  However, if the  
external API indicates with certainty that the resource is down when  
Pacemaker believes it should be up, then we should take corrective  
action immediately.


Thanks,


Andrew

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to