osaf/services/saf/amf/README_HEADLESS | 172 ++++++++++++++++++++++++++++++++++ 1 files changed, 172 insertions(+), 0 deletions(-)
diff --git a/osaf/services/saf/amf/README_HEADLESS b/osaf/services/saf/amf/README_HEADLESS new file mode 100644 --- /dev/null +++ b/osaf/services/saf/amf/README_HEADLESS @@ -0,0 +1,172 @@ +# +# -*- OpenSAF -*- +# +# (C) Copyright 2016 The OpenSAF Foundation +# +# This program is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +# or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed +# under the GNU Lesser General Public License Version 2.1, February 1999. +# The complete license can be accessed from the following location: +# http://opensource.org/licenses/lgpl-license.php +# See the Copying file included with the OpenSAF distribution for full +# licensing terms. +# +# Author(s): Ericsson AB +# + +GENERAL +------- + +This is a description of how the AMF service handles being headless (SC down) +and recovery (SC up). + +CONFIGURATION +------------- + +AMF reads the "scAbsenceAllowed" attribute to determine if headless mode is +enabled. A positive integer indicates the number of seconds AMF will tolerate +being headless, and a zero value indicates the headless feature is disabled. + +Normally, the AMF Node Director (amfnd) will restart a node if there is no active +AMF Director (amfd). If headless support is enabled, the Node Director will +delay the restart for the duration specified in "scAbsenceAllowed". If a SC +recovers during the period, the restart is aborted. + +IMPLEMENTATION DETAILS +---------------------- + +* Amfnd detects being headless: +Upon receiving NCSMDS_DOWN event which indicates the last active SC has +gone, amfnd will not reboot the node and enter headless mode (if saAbsenceAllowed +is configured) + +* Escalation and Recovery during headless: +Restarts will work as normal, but failover/switchover will be delayed until amfd +is up. During headless period, a component failover will be treated +as a SU failover to simplify error handling. +Node Failover or Switchover will result in Node Failfast. + +The repair action will be initiated when a SC returns if +saAmfSGAutoRepair is enabled. + +* Amfnd detects SC comes back from headless: +NCSMDS_UP is the event that amfnd uses to detect the presence of an active amfd +after being headless. + +* Recovery after being headless: +There could be admin operations or recovery actions in progress when the cluster enters +headless state. The normal sequence of these actions could be incomplete and therefore +leave assignments and states of AMF entities in an inappropriate manner. +New messages (state information messages) have been introduced to carry those assignments and +states from all amfnd(s), which then are sent to amfd. Amfd collects all these +messages and will recover/adjust the assignments and states which are left over from +headless. + +State information messages also contain component and SU restart counts. These +new counter values will be updated to IMM after headless recovery. + +The operation where amfnd(s) sends state information messages and amfd processes +these messages is known as a *sync* operation. + +Example 1: +Admin si-swap a 2N SI: Cluster goes headless when SU1 which has Active +assignment moves to Quiesced. Amfd will receive state information message +with one Quiesced (SU1) and one Standby (SU2) assignment. Amfd will send SU-SI +assignment message to assign SU2 to Active, and SU1 to Standby. + +Example 2: +SU failover on 2N SU: While headless, Active SU1 becomes faulty and +escalates to a SU failover. SU1's assignment is removed, marked as failed, and +operState as Disabled. Once SC comes back, amfd will send SU-SI assignment +to assign SU2 (being Standby) to Active. If AutoRepair is configured, +the SU1 will be repaired. + +Example 3: +SI dependency: While headless, both SU1 and SU2 become faulty and +all assignment of SI1 to those SUs are removed. After receiving state information +messages, if any SI have SI1 as sponsor SI, these dependent SI(s) will start +assignment removal. + +LIMITATIONS +----------- + +* Recovery actions are limited while headless. + +Failover/Switchover is delayed until SC recovery. saAmfSUFailover setting +will be ignored and will be treated as being set to 1. + +* Only 2N/NoRed/NwayAct Service Groups are supported + +For these SG types, delayed failover recovery will support most combinations +of assignment states (Quiesced/Quiescing/Standby/Active) left over +from headless. + +Example: A Standby assignment will transition to Active directly if required, +a Quiesced/Quiescing assignment will be removed if admin entity is LOCKED, +or transition to Standby. + +* SI dependency tolerance timer +After recovery from headless, if an unassigned sponsor SI is detected, all its +dependent SI(s) assignments are removed regardless of tolerance duration. The time +of sponsor SI becoming unassigned is not recorded, so the new amfd cannot +figure out how much time is left that the dependent SI(s) can tolerate. + +* Proxy and Proxied components are not yet supported + +* Alarms and notifications + +During the headless period, notifications will not be sent +as the Director in charge of sending notifications is not available. +For example, if a component fails to instantiate while headless and its +SU becomes disabled, a state change for the SU from ENABLED to DISABLED +will not be sent. + +List of possible missed notifications +===================================== +SA_AMF_PRESENCE_STATE of a SU +SA_AMF_OP_STATE of a SU +SA_AMF_HA_STATE of a SI +SA_AMF_ASSIGNMENT_STATE of a SI + +After the headless period, some redundant alarms and notifications +may be sent from the Director. Initially the Director will think +all PLs are down. But as sync info is received from PLs, alarms +will be cleared or set, and finally reflect the current state of the cluster. +For example, an alarm may initially be raised for an unassigned SI, but +later cleared as the Director learns of the SI assignment on a PL that +remained running. + +Redundant notifications +======================= +SA_AMF_PRESENCE_STATE of a SU may change from SA_AMF_PRESENCE_UNINSTANTIATED to <<current state>> +SA_AMF_OP_STATE of a SU may change from SA_AMF_OPERATIONAL_DISABLED to <<current state>> +SA_AMF_HA_STATE of a SI may change from "" to <<current state>> +SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED to <<current state>> + +Redundant alarms +================ +An unassigned SI alarm may be raised and then cleared shortly afterwards + +Furthermore, some notifications may be slightly misleading. +For example, if a SI becomes PARTIALLY_ASSIGNED from FULLY_ASSIGNED +because a component develops a fault while headless, the SI change notification +may describe the SI going from UNASSIGNED to PARTIALLY_ASSIGNED. This is +because the Director initially does not know about the existence of the SIs assigned +to PLs that remained running. + +Limited notifications +===================== +SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED +when it should be SA_AMF_ASSIGNMENT_FULLY_ASSIGNED to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED + +* Some AMF API functions will be unavailable while headless + +saAmfProtectionGroupTrack() and saAmfProtectionGroupTrackStop() return SA_AMF_ERROR_TRY_AGAIN during headless + +* One payload limitation + +Cluster is configured with one payload without PBE, IMM will reload from xml at the second time cluster going headless +That cause amfd lost all objects which were created before headless and the data inconsistency happens between +amfnd and amfnd/IMM. To avoid this inconsistency, the payload needs a reboot. + ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel