Hi Minh, I have marked some doubts with [Praveen]. Could you please answer them, it will helpful in understanding the changes in AMFD and AMFND.
Thanks, Praveen On 20-Jan-16 9:03 AM, Minh Hon Chau wrote: > osaf/services/saf/amf/README-HEADLESS | 123 > ++++++++++++++++++++++++++++++++++ > 1 files changed, 123 insertions(+), 0 deletions(-) > > > diff --git a/osaf/services/saf/amf/README-HEADLESS > b/osaf/services/saf/amf/README-HEADLESS > new file mode 100644 > --- /dev/null > +++ b/osaf/services/saf/amf/README-HEADLESS > @@ -0,0 +1,123 @@ > +# > +# -*- OpenSAF -*- > +# > +# (C) Copyright 2016 The OpenSAF Foundation > +# > +# This program is distributed in the hope that it will be useful, but > +# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY > +# or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed > +# under the GNU Lesser General Public License Version 2.1, February 1999. > +# The complete license can be accessed from the following location: > +# http://opensource.org/licenses/lgpl-license.php > +# See the Copying file included with the OpenSAF distribution for full > +# licensing terms. > +# > +# Author(s): Ericsson AB > +# > + > +GENERAL > +------- > + > +This is a description of how the AMF service handles being headless (SC down) > +and recovery (SC up). > + > +CONFIGURATION > +------------- > + > +AMF reads the "scAbsenceAllowed" attribute to determine if headless mode is > +supported. A positive integer indicates the number of seconds AMF will > tolerate > +being headless, and a zero value indicates the headless feature is disabled. > + > +Normally, the AMF Node Director (amfnd) restarts a node if there is no active > +AMF Director (amfd). If headless support is enabled, the Node Director will > +delay the restart for the duration specified in "scAbsenceAllowed". If a SC > +recovers during the period, the restart is aborted. > + [Praveen] What happens to controller amfnds? It there is problem with only directors on the controller then how controller amfnd reacts to it. Is there any difference between the handling for AMFND on controllers and on payloads. > +IMPLEMENTATION DETAILS > +---------------------- > + > +* Amfnd detects being headless: > +Up on receiving NCSMDS_DOWN event which indicates the last active SC has > +gone, amfnd does not reboot node and entering headless mode (if > saAbsenceAllowed > +configured) > + > +* Escalation and Recovery during headless: > +The Restart faulty is still real-time recovered as before, while only > recovery > +has failover/switchover involved will be delayed until amfd up. [Praveen] For comp restart recovery failover and switchover are not aplicable, so which type of switchover or failover. If Component or > +Su failover happens, the component/Su will marked as failed only. The repair > +action will be initiated when SC comes back but it also depends on desired > +configurations. The Node Failover or Switchover will result in a node > restart. > + > +* Amfnd detects SC comes back from headless: > +NCSMDS_UP is the event that amfnd detects active amfd's presence after being > +headless. > + > +* Recovery after being headless: > +There could be admin ops or recovery actions in progress while cluster enters > +headless. The normal sequence of those actions are uncompleted and therefore > it > +will leave assignments and states of AMF entities inappropriate. The new > messages > +(state information messages) have been introduced to carry those assignments > and > +states from all amfnd(s), which then are sent to amfd. Amfd collects all > these > +messages and recover/adjust the assignments and states which are left over > from > +headless. > + [Praveen] In case AMFND deletes the assignments in case like su-failover or node-failover node-switcover then how AMFD will asjust assignments since it will not get the assignments from AMFND. > +State information messages also contain component and SU restart counts, > these > +new counter value will be updated to IMM after headless. > + > +The operation that amfnd(s) send state information messages and amfd > processes > +these messages, is known as a *sync* operation, and has been refered in > +implementation. > + > +Example 1: > +Admin si-swap an 2N SI: Cluster goes headless at the time which SU1 has > Active > +assignment moves to Quiesced. Amfd will receives state information message > +with one Quiesced (SU1) and one Standby (SU2) assignment. Amfd will send su > +si assignment message to assign SU2 to Active, and SU1's to Standby. > + > +Example 2: > +Su failover on 2N SU: During headless cluster, Active SU1 has faulty that > +escalates to a SU failover. SU1's assignment is removed, marked as failed, > +operState as Disabled. Once SC comes back, amfd will send su si assignment > +to assign SU2 (being Standby) to Active. Depends on AutoRepair is whether > +configured, the SU1 will be repaired. > + > +Example 3: > +Si dependency: During headless cluster, both SU1 and SU2 have faulty that > +all assignment of SI1 to those SUs are removed. After receiving state > information > +messages, if any SI have SI1 as sponsored SI, these dependent SI(s) will > start > +assignment removal. > + > +* Other services interfaces: > +Only Amfd uses Log, Ntf api, and those functions in AMF which require Log/Ntf > +are limited during headless. > +The Clm functionalities (mostly track cb) Amfnd uses should work as before, > +only thing Amfnd needs is to reinitialize Clm handle when active SC comes > back > +from headless. [Praveen] Since CLM handle becomes invalid, AMFND will not get CLM track callbacks for Controllers. How amfnd will send the node_up message to AMFD? Is it the same like based on MDS callbacks and after CLM reinitialization. > +Imm Admin ops are not supported during headless since no active Amfd. > +In general, the AMF functions that requires Amfd involvement is not supported > +during headless. And all those functions will work as normal after SC comes > back > +from headless. > + > +LIMITATIONS > +----------- > + > +* Recovery actions are limited while headless. > +Failover/Switchover is delayed until SC recovery. > + > +* Delayed failover recovery is supported for 2N Service Group > +Only for 2N Service Group, delayed failover recovery supports most of > combination > +assignment states (Quiesced/Quiescing/Standby/Active) which are left over > +from headless. > +EX: A Standby assignment will transition to Active directly if required, > +a Quiesced/Quiescing assignment will be removed if admin entity is LOCKED, > +or transition to Standby, etc... > + > +For other Service Group types, Standby assignments are first removed, and > +reassigned as appropriate for the SG. > + > +* Tolerant timer of SI dependency. > +After recovery from headless, if unassigned sponsor SI is detected, all its > +dependent SI(s) assignment are removed regardless tolerant duration. The time > +of sponsor SI becoming unassigned is not recorded so that the new amfd can > +figure how much time is left that the dependent SI(s) can tolerate. > + > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel