Re: [Linux-ha-dev] [PATCH] High: SAPInstance: Fixed monitor_clone function to ensure enqueue failover, in case of process (not host) failure

alexander . krauth Thu, 30 Dec 2010 01:24:25 -0800

> On Wed, Dec 29, 2010 at 03:04:18PM +0100, Alexander Krauth wrote:
> > # HG changeset patch
> > # User Alexander Krauth <[email protected]>
> > # Date 1293631454 -3600
> > # Node ID a1f4bf0db5ff8c7c2ebd02e413df5e15201d4a7c
> > # Parent  69cd9345a879e7764b4457834ded0093274d0322
> > High: SAPInstance: Fixed monitor_clone function to ensure enqueue 
failover, in case of process (not host) failure
> > 
> > RAs in versions <= 2.01 used a Heartbeat 2.0 specific feature to 
distinquish, if running in master or slave mode.
> > This is not working with Pacemaker anymore.
> > 
> > Since RA version 2.02 (not in official release) the monitor_clone 
function is damaged for the case of a local failure of the Standalone 
Enqueue process.
> > 
> > This patch follows the requirement, that the RA must know be itself, 
if it is running in master or slave mode.
> > Also it ensures, that always the salve (Enqueue Replication Server) 
gets promoted, if the master (Standalone Enqueue Server) fails.
> > 
> > diff -r 69cd9345a879 -r a1f4bf0db5ff heartbeat/SAPInstance
> > --- a/heartbeat/SAPInstance   Wed Dec 29 14:40:41 2010 +0100
> > +++ b/heartbeat/SAPInstance   Wed Dec 29 15:04:14 2010 +0100
> > @@ -32,6 +32,10 @@
> >  #   OCF_RESKEY_PRE_STOP_USEREXIT   (optional, lists a script which 
can be executed before the resource is stopped)
> >  #   OCF_RESKEY_POST_STOP_USEREXIT   (optional, lists a script which 
can be executed after the resource is stopped)
> >  #
> > +#  TODO: - Option to shutdown sapstartsrv for non-active instances -> 
that means: do probes only with OS tools (sapinstance_status)
> > +#        - Option for better standalone enqueue server monitoring, 
using ensmon (test enque-deque)
> > +#        - Option for cleanup abandoned enqueue replication tables
> > +#
> > 
#######################################################################
> >  # Initialization:
> > 
> > @@ -68,7 +72,7 @@
> >  <?xml version="1.0"?>
> >  <!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
> >  <resource-agent name="SAPInstance">
> > -<version>2.11</version>
> > +<version>2.12</version>
> > 
> >  <shortdesc lang="en">Manages a SAP instance as an HA 
resource.</shortdesc>
> >  <longdesc lang="en">
> > @@ -708,7 +712,7 @@
> >  #
> >  sapinstance_start_clone() {
> >    sapinstance_init $OCF_RESKEY_ERS_InstanceName
> > -  ${HA_SBIN_DIR}/crm_master -v 100 -l reboot
> > +  ${HA_SBIN_DIR}/crm_master -v 50 -l reboot
> >    sapinstance_start
> >    return $?
> >  }
> > @@ -729,17 +733,38 @@
> >  # sapinstance_monitor_clone
> >  #
> >  sapinstance_monitor_clone() {
> > -  # Check status of potential master first
> > +  # first check with the status function (OS tools) if there could be 
something like a SAP instance running
> > +  # as we do not know here, if we are in master or slave state we do 
not want to start our monitoring
> > +  # agents (sapstartsrv) on the wrong host
> > +
> >    sapinstance_init $OCF_RESKEY_InstanceName
> > -  sapinstance_monitor
> > +  sapinstance_status
> >    rc=$?
> > -  [ $rc -eq $OCF_SUCCESS ] && return $OCF_RUNNING_MASTER
> > -  [ $rc -ne $OCF_NOT_RUNNING ] && return $OCF_FAILED_MASTER
> > -
> > -  # The master isn't running, and there were no errors, try ERS
> > -  sapinstance_init $OCF_RESKEY_ERS_InstanceName
> > -  sapinstance_monitor
> > -  rc=$?
> > +  if [ $rc -eq $OCF_SUCCESS ]; then
> > +    sapinstance_monitor
> > +    rc=$?
> > +    if [ $rc -eq $OCF_SUCCESS ]; then
> > +      ${HA_SBIN_DIR}/crm_master -Q -v 100 -l reboot
> > +      return $OCF_RUNNING_MASTER
> > +    else
> > +      ${HA_SBIN_DIR}/crm_master -v 10 -l reboot     # by nature of 
the SAP enqueue server we have to make sure
> 
> Shouldn't this be something like '-v -10'? I'm really not
> sure, but if the master failed then this node may not be
> capable of running the master.


No, it should stay positive. This is for the rare case, that we do not 
have a slave running (slave failed, only one node active, ...).
In that situation we want at least to do/try a local restart of the 
master.
If there is a slave somewhere, it will have a higher value anyway.

> > +                                                    # that we do a 
failover to the slave (enqueue replication server)
> > +                                                    # in case the 
enqueue process has failed. We signal this to the
> > +                                                    # cluster by 
setting our master preference to a lower value than the slave.
> > +      return $OCF_FAILED_MASTER
> > +    fi
> > +  else
> > +    sapinstance_init $OCF_RESKEY_ERS_InstanceName
> > +    sapinstance_status
> > +    rc=$?
> > +    if [ $rc -eq $OCF_SUCCESS ]; then
> > +      sapinstance_monitor
> > +      rc=$?
> > +      if [ $rc -eq $OCF_SUCCESS ]; then
> > +        ${HA_SBIN_DIR}/crm_master -Q -v 100 -l reboot
> > +      fi
> > +    fi
> > +  fi
> 
> I got lost in this monitor function. A bit (hopefully) cleaner
> version attached. Can you please review.

Yes, I reviewed it. Looks very fine for me. Please apply your patch 
instead of mine.
Properly I can than do some testing next week with the final version.

> Thanks,
> Dejan

Regards,
Alex

> >    return $rc
> >  }
> > @@ -785,16 +810,25 @@
> > 
> > 
> >  #
> > -# sapinstance_notify: After promotion of one master in the cluster, 
we make sure that all clones reset thier master
> > -#                     value back to 100. This is because a failed 
monitor on a master might have degree one clone
> > -#                     instance to score 10.
> > +# sapinstance_notify: Handle master scoring - to make sure a slave 
gets the next master
> >  #
> >  sapinstance_notify() {
> >    local n_type="$OCF_RESKEY_CRM_meta_notify_type"
> >    local n_op="$OCF_RESKEY_CRM_meta_notify_operation"
> > 
> >    if [ "${n_type}_${n_op}" = "post_promote" ]; then
> > +    # After promotion of one master in the cluster, we make sure that 
all clones reset their master
> > +    # value back to 100. This is because a failed monitor on a master 
might have degree one clone
> > +    # instance to score 10.
> >      ${HA_SBIN_DIR}/crm_master -v 100 -l reboot
> > +  elif [ "${n_type}_${n_op}" = "pre_demote" ]; then
> > +    # if we are a slave and a demote event is anounced, make sure we 
have the highes wish to became master
> > +    # that is, when a slave resource was startet after the promote 
event of a already running master (e.g. node of slave was down)
> > +    # We also have to make sure to overrule the globaly set 
resource_stickiness or any fail-count factors => INFINITY
> > +    local n_uname="$OCF_RESKEY_CRM_meta_notify_demote_uname"
> > +    if [ ${n_uname} != ${HOSTNAME} ]; then
> > +      ${HA_SBIN_DIR}/crm_master -v INFINITY -l reboot
> > +    fi
> >    fi
> >  }

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] High: SAPInstance: Fixed monitor_clone function to ensure enqueue failover, in case of process (not host) failure

Reply via email to