Re: [devel] [PATCH 2 of 2] AMF: Update README for SC Absence feature [#2033]

praveen malviya Fri, 23 Sep 2016 01:38:09 -0700

Ack from me for README and PR doc.

Thanks,
Praveen






On 22-Sep-16 10:50 AM, Minh Hon Chau wrote:
>  osaf/services/saf/amf/README_HEADLESS |  151 
> +++++++++++++++++----------------
>  1 files changed, 76 insertions(+), 75 deletions(-)
>
>
> Rephrase Headless to SC absence, plus documentation for
> admin continuation
>
> diff --git a/osaf/services/saf/amf/README_HEADLESS 
> b/osaf/services/saf/amf/README_SC_ABSENCE
> rename from osaf/services/saf/amf/README_HEADLESS
> rename to osaf/services/saf/amf/README_SC_ABSENCE
> --- a/osaf/services/saf/amf/README_HEADLESS
> +++ b/osaf/services/saf/amf/README_SC_ABSENCE
> @@ -18,86 +18,87 @@
>  GENERAL
>  -------
>
> -This is a description of how the AMF service handles being headless (SC down)
> -and recovery (SC up).
> +This is a description of how the AMF service suppports the SC absence feature
> +which allows payloads to remain running during the absence of both SCs, and
> +perform recovery after at least one SC comes back.
>
>  CONFIGURATION
>  -------------
>
> -AMF reads the "scAbsenceAllowed" attribute to determine if headless mode is
> -enabled. A positive integer indicates the number of seconds AMF will tolerate
> -being headless, and a zero value indicates the headless feature is disabled.
> +AMF reads the "scAbsenceAllowed" attribute to determine if SC absence feature
> +is enabled. A positive integer indicates the number of seconds AMF will
> +tolerate the absence period of both SCs, and a zero value indicates this
> +feature is disabled.
>
> -Normally, the AMF Node Director (amfnd) will restart a node if there is no 
> active
> -AMF Director (amfd). If headless support is enabled, the Node Director will
> -delay the restart for the duration specified in "scAbsenceAllowed". If a SC
> -recovers during the period, the restart is aborted.
> +Normally, the AMF Node Director (amfnd) will restart a node if there is no
> +active AMF Director (amfd). If this feature is enabled, the Node Director 
> will
> +delay the restart for the duration specified in "scAbsenceAllowed". If a SC
> +returns during the period, the restart is aborted.
>
>  IMPLEMENTATION DETAILS
>  ----------------------
>
> -* Amfnd detects being headless:
> -Upon receiving NCSMDS_DOWN event which indicates the last active SC has
> -gone, amfnd will not reboot the node and enter headless mode (if 
> saAbsenceAllowed
> -is configured)
> +* Amfnd detects absence of SCs:
> +Upon receiving NCSMDS_DOWN event which indicates the last active SC has gone,
> +amfnd will not reboot the node and enters SC absence period (if
> +scAbsenceAllowed is configured)
>
> -* Escalation and Recovery during headless:
> -Restarts will work as normal, but failover or switchover will
> -result in Node Failfast.
> -
> -The repair action will be initiated when a SC returns if
> +* Escalation and Recovery during SC absence period:
> +Restarts will work as normal, but failover or switchover will result in Node
> +Failfast. The repair action will be initiated when a SC returns if
>  saAmfSGAutoRepair is enabled.
>
> -* Amfnd detects SC comes back from headless:
> -NCSMDS_UP is the event that amfnd uses to detect the presence of an active 
> amfd
> -after being headless.
> +* Amfnd detects return of SCs:
> +NCSMDS_UP is the event that amfnd uses to detect the presence of an active 
> amfd.
>
>  * New sync messages
> +New messages (state information messages) have been introduced to carry
> +assignments and states from all amfnd(s), which then are sent to amfd. State
> +information messages also contain component and SU restart counts. These new
> +counter values will be updated to IMM after recovery.The operation where
> +amfnd(s) sends state information messages and amfd processes these messages
> +is known as a *sync* operation.
>
> -New messages (state information messages) have been introduced to carry 
> assignments and
> -states from all amfnd(s), which then are sent to amfd.
> +* Admin operation continuation
> +If an admin operation on an AMF entity is still in progress when the cluster
> +loses both SCs, the operation will continue when a SC returns. In order to
> +resume the admin operation, AMF internal states that are used in the admin
> +operation need to be restored. In a normal cluster state, these states are
> +*regularly* checkpointed to the standby AMFD so that the standby AMFD can
> +take over the active role if the active AMFD goes down. Using a similar
> +approach, new AMF runtime cached attributes are introduced to store the 
> states
> +in IMM, as another method of restoring these states for the purpose of SC
> +absence recovery. The new attributes are:
> +- osafAmfSISUFsmState:SUSI fsm state
> +- osafAmfSGFsmState:SG fsm state
> +- osafAmfSGSuOperationList:SU operation list of SG
> +- osafAmfSUSwitch:SU switch toggle.
>
> -State information messages also contain component and SU restart counts. 
> These
> -new counter values will be updated to IMM after headless recovery.
> -
> -The operation where amfnd(s) sends state information messages and amfd 
> processes
> -these messages is known as a *sync* operation.
> +Only 2N SG is currently supported for admin operation continuation.
>
>  LIMITATIONS
>  -----------
>
> -* Recovery actions are limited while headless.
> -
> -Failover/Switchover will result in node failfast.
> -
> -* No recovery support if a failover, switchover or node failfast occurs 
> during headless state
> -
> -If PL is rebooted during headless state, then SI assignments may be improper 
> after headless recovery.
> -
> -* No recovery support if an operation or recovery action is in progress 
> while entering headless state
> -
> -If an admin operation or recovery action is in progress when the cluster 
> enters
> -headless state, the normal sequence of these actions could be incomplete and 
> therefore
> -leave assignments and states of AMF entities in an inappropriate manner.
> -
> -Recovery from this is currently *not supported*.
> +* While both SCs are absent, any failover or switchover escalation will 
> result
> +in node failfast. The events of node reboot, node power off, and node 
> failfast
> +will lead to a loss of SI assignments, which are not restored during the SC
> +absence period. The SI assignments may remain in improper states until a SC
> +comes back. Recovery of any lost SI assignments during SC absence period is
> +currently not supported.
>
>  * SI dependency tolerance timer
> -
> -After recovery from headless, if an unassigned sponsor SI is detected, all 
> its
> -dependent SI(s) assignments are removed regardless of tolerance duration. 
> The time
> -of sponsor SI becoming unassigned is not recorded, so the new amfd cannot
> +After a SC comes back, if an unassigned sponsor SI is detected, all its
> +dependent SI(s) assignments are removed regardless of tolerance duration. The
> +time of sponsor SI becoming unassigned is not recorded, so the new amfd 
> cannot
>  figure out how much time is left that the dependent SI(s) can tolerate.
>
>  * Proxy and Proxied components are not yet supported
>
>  * Alarms and notifications
> -
> -During the headless period, notifications will not be sent
> -as the Director in charge of sending notifications is not available.
> -For example, if a component fails to instantiate while headless and its
> -SU becomes disabled, a state change for the SU from ENABLED to DISABLED
> -will not be sent.
> +During the SC absence period, notifications will not be sent as the Director 
> in
> +charge of sending notifications is not available. For example, if a component
> +fails to instantiate while SC absence stage and its SU becomes disabled, a 
> state
> +change for the SU from ENABLED to DISABLED will not be sent.
>
>  List of possible missed notifications
>  =====================================
> @@ -106,13 +107,12 @@ SA_AMF_OP_STATE of a SU
>  SA_AMF_HA_STATE of a SI
>  SA_AMF_ASSIGNMENT_STATE of a SI
>
> -After the headless period, some redundant alarms and notifications
> -may be sent from the Director. Initially the Director will think
> -all PLs are down. But as sync info is received from PLs, alarms
> -will be cleared or set, and finally reflect the current state of the cluster.
> -For example, an alarm may initially be raised for an unassigned SI, but
> -later cleared as the Director learns of the SI assignment on a PL that
> -remained running.
> +After the SC absence period, some redundant alarms and notifications may be 
> sent
> +from the Director. Initially the Director will think all PLs are down. But as
> +sync info is received from PLs, alarms will be cleared or set, and finally 
> reflect
> +the current state of the cluster. For example, an alarm may initially be 
> raised
> +for an unassigned SI, but later cleared as the Director learns of the SI 
> assignment
> + on a PL that remained running.
>
>  Redundant notifications
>  =======================
> @@ -125,26 +125,27 @@ Redundant alarms
>  ================
>  An unassigned SI alarm may be raised and then cleared shortly afterwards
>
> -Furthermore, some notifications may be slightly misleading.
> -For example, if a SI becomes PARTIALLY_ASSIGNED from FULLY_ASSIGNED
> -because a component develops a fault while headless, the SI change 
> notification
> -may describe the SI going from UNASSIGNED to PARTIALLY_ASSIGNED. This is
> -because the Director initially does not know about the existence of the SIs 
> assigned
> -to PLs that remained running.
> +Furthermore, some notifications may be slightly misleading. For example, if 
> a SI
> +becomes PARTIALLY_ASSIGNED from FULLY_ASSIGNED because a component develops 
> a fault
> +while SC absence period, the SI change notification may describe the SI 
> going from
> +UNASSIGNED to PARTIALLY_ASSIGNED. This is because the Director initially 
> does not
> +know about the existence of the SIs assigned to PLs that remained running.
>
>  Limited notifications
>  =====================
> -SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED 
> to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED
> -when it should be SA_AMF_ASSIGNMENT_FULLY_ASSIGNED to 
> SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED
> +SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED 
> to
> +SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED when it should be
> +SA_AMF_ASSIGNMENT_FULLY_ASSIGNED to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED
>
> -* Some AMF API functions will be unavailable while headless
> -
> -saAmfProtectionGroupTrack() and saAmfProtectionGroupTrackStop() return 
> SA_AMF_ERROR_TRY_AGAIN during headless
> +* Some AMF API functions will be unavailable while SC absence period
> +saAmfProtectionGroupTrack() and saAmfProtectionGroupTrackStop() return
> +SA_AMF_ERROR_TRY_AGAIN.
>
>  * One payload limitation
>
> -If the cluster cluster is configured with one payload without PBE, IMM will 
> reload
> -from XML the second time the cluster goes headless. This causes amfd to lose 
> all objects
> -which were created before headless and data inconsistency will occur between
> -amfnd and amfd/IMM on the SC. To avoid this inconsistency, the payload will 
> be rebooted.
> +If the cluster is configured with one payload without PBE, IMM will reload 
> from
> +XML the second time the cluster experiences the absence of both SCs. This 
> causes
> +amfd to lose all objects which were created before SC absence and data
> +inconsistency will occur between amfnd and amfd/IMM on the SC. To avoid this
> +inconsistency, the payload will be rebooted.
>
>

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 2 of 2] AMF: Update README for SC Absence feature [#2033]

Reply via email to