osaf/services/saf/amf/README_HEADLESS | 151 +++++++++++++++++---------------- 1 files changed, 76 insertions(+), 75 deletions(-)
Rephrase Headless to SC absence, plus documentation for admin continuation diff --git a/osaf/services/saf/amf/README_HEADLESS b/osaf/services/saf/amf/README_SC_ABSENCE rename from osaf/services/saf/amf/README_HEADLESS rename to osaf/services/saf/amf/README_SC_ABSENCE --- a/osaf/services/saf/amf/README_HEADLESS +++ b/osaf/services/saf/amf/README_SC_ABSENCE @@ -18,86 +18,87 @@ GENERAL ------- -This is a description of how the AMF service handles being headless (SC down) -and recovery (SC up). +This is a description of how the AMF service suppports the SC absence feature +which allows payloads to remain running during the absence of both SCs, and +perform recovery after at least one SC comes back. CONFIGURATION ------------- -AMF reads the "scAbsenceAllowed" attribute to determine if headless mode is -enabled. A positive integer indicates the number of seconds AMF will tolerate -being headless, and a zero value indicates the headless feature is disabled. +AMF reads the "scAbsenceAllowed" attribute to determine if SC absence feature +is enabled. A positive integer indicates the number of seconds AMF will +tolerate the absence period of both SCs, and a zero value indicates this +feature is disabled. -Normally, the AMF Node Director (amfnd) will restart a node if there is no active -AMF Director (amfd). If headless support is enabled, the Node Director will -delay the restart for the duration specified in "scAbsenceAllowed". If a SC -recovers during the period, the restart is aborted. +Normally, the AMF Node Director (amfnd) will restart a node if there is no +active AMF Director (amfd). If this feature is enabled, the Node Director will +delay the restart for the duration specified in "scAbsenceAllowed". If a SC +returns during the period, the restart is aborted. IMPLEMENTATION DETAILS ---------------------- -* Amfnd detects being headless: -Upon receiving NCSMDS_DOWN event which indicates the last active SC has -gone, amfnd will not reboot the node and enter headless mode (if saAbsenceAllowed -is configured) +* Amfnd detects absence of SCs: +Upon receiving NCSMDS_DOWN event which indicates the last active SC has gone, +amfnd will not reboot the node and enters SC absence period (if +scAbsenceAllowed is configured) -* Escalation and Recovery during headless: -Restarts will work as normal, but failover or switchover will -result in Node Failfast. - -The repair action will be initiated when a SC returns if +* Escalation and Recovery during SC absence period: +Restarts will work as normal, but failover or switchover will result in Node +Failfast. The repair action will be initiated when a SC returns if saAmfSGAutoRepair is enabled. -* Amfnd detects SC comes back from headless: -NCSMDS_UP is the event that amfnd uses to detect the presence of an active amfd -after being headless. +* Amfnd detects return of SCs: +NCSMDS_UP is the event that amfnd uses to detect the presence of an active amfd. * New sync messages +New messages (state information messages) have been introduced to carry +assignments and states from all amfnd(s), which then are sent to amfd. State +information messages also contain component and SU restart counts. These new +counter values will be updated to IMM after recovery.The operation where +amfnd(s) sends state information messages and amfd processes these messages +is known as a *sync* operation. -New messages (state information messages) have been introduced to carry assignments and -states from all amfnd(s), which then are sent to amfd. +* Admin operation continuation +If an admin operation on an AMF entity is still in progress when the cluster +loses both SCs, the operation will continue when a SC returns. In order to +resume the admin operation, AMF internal states that are used in the admin +operation need to be restored. In a normal cluster state, these states are +*regularly* checkpointed to the standby AMFD so that the standby AMFD can +take over the active role if the active AMFD goes down. Using a similar +approach, new AMF runtime cached attributes are introduced to store the states +in IMM, as another method of restoring these states for the purpose of SC +absence recovery. The new attributes are: +- osafAmfSISUFsmState:SUSI fsm state +- osafAmfSGFsmState:SG fsm state +- osafAmfSGSuOperationList:SU operation list of SG +- osafAmfSUSwitch:SU switch toggle. -State information messages also contain component and SU restart counts. These -new counter values will be updated to IMM after headless recovery. - -The operation where amfnd(s) sends state information messages and amfd processes -these messages is known as a *sync* operation. +Only 2N SG is currently supported for admin operation continuation. LIMITATIONS ----------- -* Recovery actions are limited while headless. - -Failover/Switchover will result in node failfast. - -* No recovery support if a failover, switchover or node failfast occurs during headless state - -If PL is rebooted during headless state, then SI assignments may be improper after headless recovery. - -* No recovery support if an operation or recovery action is in progress while entering headless state - -If an admin operation or recovery action is in progress when the cluster enters -headless state, the normal sequence of these actions could be incomplete and therefore -leave assignments and states of AMF entities in an inappropriate manner. - -Recovery from this is currently *not supported*. +* While both SCs are absent, any failover or switchover escalation will result +in node failfast. The events of node reboot, node power off, and node failfast +will lead to a loss of SI assignments, which are not restored during the SC +absence period. The SI assignments may remain in improper states until a SC +comes back. Recovery of any lost SI assignments during SC absence period is +currently not supported. * SI dependency tolerance timer - -After recovery from headless, if an unassigned sponsor SI is detected, all its -dependent SI(s) assignments are removed regardless of tolerance duration. The time -of sponsor SI becoming unassigned is not recorded, so the new amfd cannot +After a SC comes back, if an unassigned sponsor SI is detected, all its +dependent SI(s) assignments are removed regardless of tolerance duration. The +time of sponsor SI becoming unassigned is not recorded, so the new amfd cannot figure out how much time is left that the dependent SI(s) can tolerate. * Proxy and Proxied components are not yet supported * Alarms and notifications - -During the headless period, notifications will not be sent -as the Director in charge of sending notifications is not available. -For example, if a component fails to instantiate while headless and its -SU becomes disabled, a state change for the SU from ENABLED to DISABLED -will not be sent. +During the SC absence period, notifications will not be sent as the Director in +charge of sending notifications is not available. For example, if a component +fails to instantiate while SC absence stage and its SU becomes disabled, a state +change for the SU from ENABLED to DISABLED will not be sent. List of possible missed notifications ===================================== @@ -106,13 +107,12 @@ SA_AMF_OP_STATE of a SU SA_AMF_HA_STATE of a SI SA_AMF_ASSIGNMENT_STATE of a SI -After the headless period, some redundant alarms and notifications -may be sent from the Director. Initially the Director will think -all PLs are down. But as sync info is received from PLs, alarms -will be cleared or set, and finally reflect the current state of the cluster. -For example, an alarm may initially be raised for an unassigned SI, but -later cleared as the Director learns of the SI assignment on a PL that -remained running. +After the SC absence period, some redundant alarms and notifications may be sent +from the Director. Initially the Director will think all PLs are down. But as +sync info is received from PLs, alarms will be cleared or set, and finally reflect +the current state of the cluster. For example, an alarm may initially be raised +for an unassigned SI, but later cleared as the Director learns of the SI assignment + on a PL that remained running. Redundant notifications ======================= @@ -125,26 +125,27 @@ Redundant alarms ================ An unassigned SI alarm may be raised and then cleared shortly afterwards -Furthermore, some notifications may be slightly misleading. -For example, if a SI becomes PARTIALLY_ASSIGNED from FULLY_ASSIGNED -because a component develops a fault while headless, the SI change notification -may describe the SI going from UNASSIGNED to PARTIALLY_ASSIGNED. This is -because the Director initially does not know about the existence of the SIs assigned -to PLs that remained running. +Furthermore, some notifications may be slightly misleading. For example, if a SI +becomes PARTIALLY_ASSIGNED from FULLY_ASSIGNED because a component develops a fault +while SC absence period, the SI change notification may describe the SI going from +UNASSIGNED to PARTIALLY_ASSIGNED. This is because the Director initially does not +know about the existence of the SIs assigned to PLs that remained running. Limited notifications ===================== -SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED -when it should be SA_AMF_ASSIGNMENT_FULLY_ASSIGNED to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED +SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED to +SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED when it should be +SA_AMF_ASSIGNMENT_FULLY_ASSIGNED to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED -* Some AMF API functions will be unavailable while headless - -saAmfProtectionGroupTrack() and saAmfProtectionGroupTrackStop() return SA_AMF_ERROR_TRY_AGAIN during headless +* Some AMF API functions will be unavailable while SC absence period +saAmfProtectionGroupTrack() and saAmfProtectionGroupTrackStop() return +SA_AMF_ERROR_TRY_AGAIN. * One payload limitation -If the cluster cluster is configured with one payload without PBE, IMM will reload -from XML the second time the cluster goes headless. This causes amfd to lose all objects -which were created before headless and data inconsistency will occur between -amfnd and amfd/IMM on the SC. To avoid this inconsistency, the payload will be rebooted. +If the cluster is configured with one payload without PBE, IMM will reload from +XML the second time the cluster experiences the absence of both SCs. This causes +amfd to lose all objects which were created before SC absence and data +inconsistency will occur between amfnd and amfd/IMM on the SC. To avoid this +inconsistency, the payload will be rebooted. ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel