Hi Praveen, Please find my comments inline [Minh]
Thanks, Minh On 11/02/16 17:10, praveen malviya wrote: > Hi Minh, > > I have marked some doubts with [Praveen]. Could you please answer > them, it will helpful in understanding the changes in AMFD and AMFND. > > Thanks, > Praveen > > On 20-Jan-16 9:03 AM, Minh Hon Chau wrote: >> osaf/services/saf/amf/README-HEADLESS | 123 >> ++++++++++++++++++++++++++++++++++ >> 1 files changed, 123 insertions(+), 0 deletions(-) >> >> >> diff --git a/osaf/services/saf/amf/README-HEADLESS >> b/osaf/services/saf/amf/README-HEADLESS >> new file mode 100644 >> --- /dev/null >> +++ b/osaf/services/saf/amf/README-HEADLESS >> @@ -0,0 +1,123 @@ >> +# >> +# -*- OpenSAF -*- >> +# >> +# (C) Copyright 2016 The OpenSAF Foundation >> +# >> +# This program is distributed in the hope that it will be useful, but >> +# WITHOUT ANY WARRANTY; without even the implied warranty of >> MERCHANTABILITY >> +# or FITNESS FOR A PARTICULAR PURPOSE. This file and program are >> licensed >> +# under the GNU Lesser General Public License Version 2.1, February >> 1999. >> +# The complete license can be accessed from the following location: >> +# http://opensource.org/licenses/lgpl-license.php >> +# See the Copying file included with the OpenSAF distribution for full >> +# licensing terms. >> +# >> +# Author(s): Ericsson AB >> +# >> + >> +GENERAL >> +------- >> + >> +This is a description of how the AMF service handles being headless >> (SC down) >> +and recovery (SC up). >> + >> +CONFIGURATION >> +------------- >> + >> +AMF reads the "scAbsenceAllowed" attribute to determine if headless >> mode is >> +supported. A positive integer indicates the number of seconds AMF >> will tolerate >> +being headless, and a zero value indicates the headless feature is >> disabled. >> + >> +Normally, the AMF Node Director (amfnd) restarts a node if there is >> no active >> +AMF Director (amfd). If headless support is enabled, the Node >> Director will >> +delay the restart for the duration specified in "scAbsenceAllowed". >> If a SC >> +recovers during the period, the restart is aborted. >> + > [Praveen] What happens to controller amfnds? It there is problem with > only directors on the controller then how controller amfnd reacts to > it. Is there any difference between the handling for AMFND on > controllers and on payloads. > [Minh] The usecase is to take down the whole SC, not just only amfd. > The behaviour of amfnd on controller would be similar as before, we're > using leds_set in node_up to indicate which amfnd is newly started or > veteran. From there, the handling of node_up message in amfd will > treat amfnd differently up on leds_set. >> +IMPLEMENTATION DETAILS >> +---------------------- >> + >> +* Amfnd detects being headless: >> +Up on receiving NCSMDS_DOWN event which indicates the last active SC >> has >> +gone, amfnd does not reboot node and entering headless mode (if >> saAbsenceAllowed >> +configured) >> + >> +* Escalation and Recovery during headless: >> +The Restart faulty is still real-time recovered as before, while >> only recovery >> +has failover/switchover involved will be delayed until amfd up. > [Praveen] For comp restart recovery failover and switchover are not > aplicable, so which type of switchover or failover. [Minh] all switchover/failover on comp/su are delayed until SC comes back > If Component or >> +Su failover happens, the component/Su will marked as failed only. >> The repair >> +action will be initiated when SC comes back but it also depends on >> desired >> +configurations. The Node Failover or Switchover will result in a >> node restart. >> + >> +* Amfnd detects SC comes back from headless: >> +NCSMDS_UP is the event that amfnd detects active amfd's presence >> after being >> +headless. >> + >> +* Recovery after being headless: >> +There could be admin ops or recovery actions in progress while >> cluster enters >> +headless. The normal sequence of those actions are uncompleted and >> therefore it >> +will leave assignments and states of AMF entities inappropriate. The >> new messages >> +(state information messages) have been introduced to carry those >> assignments and >> +states from all amfnd(s), which then are sent to amfd. Amfd collects >> all these >> +messages and recover/adjust the assignments and states which are >> left over from >> +headless. >> + > [Praveen] In case AMFND deletes the assignments in case like > su-failover or node-failover node-switcover then how AMFD will asjust > assignments since it will not get the assignments from AMFND. [Minh] So this case AMFD doesn't have to adjust the assignment since there's no assignment. Assignments could be removed due to faulty during headless and AMFD will try to repair if auto-repair is configured. >> +State information messages also contain component and SU restart >> counts, these >> +new counter value will be updated to IMM after headless. >> + >> +The operation that amfnd(s) send state information messages and amfd >> processes >> +these messages, is known as a *sync* operation, and has been refered in >> +implementation. >> + >> +Example 1: >> +Admin si-swap an 2N SI: Cluster goes headless at the time which SU1 >> has Active >> +assignment moves to Quiesced. Amfd will receives state information >> message >> +with one Quiesced (SU1) and one Standby (SU2) assignment. Amfd will >> send su >> +si assignment message to assign SU2 to Active, and SU1's to Standby. >> + >> +Example 2: >> +Su failover on 2N SU: During headless cluster, Active SU1 has faulty >> that >> +escalates to a SU failover. SU1's assignment is removed, marked as >> failed, >> +operState as Disabled. Once SC comes back, amfd will send su si >> assignment >> +to assign SU2 (being Standby) to Active. Depends on AutoRepair is >> whether >> +configured, the SU1 will be repaired. >> + >> +Example 3: >> +Si dependency: During headless cluster, both SU1 and SU2 have faulty >> that >> +all assignment of SI1 to those SUs are removed. After receiving >> state information >> +messages, if any SI have SI1 as sponsored SI, these dependent SI(s) >> will start >> +assignment removal. >> + >> +* Other services interfaces: >> +Only Amfd uses Log, Ntf api, and those functions in AMF which >> require Log/Ntf >> +are limited during headless. >> +The Clm functionalities (mostly track cb) Amfnd uses should work as >> before, >> +only thing Amfnd needs is to reinitialize Clm handle when active SC >> comes back >> +from headless. > [Praveen] Since CLM handle becomes invalid, AMFND will not get CLM > track callbacks for Controllers. How amfnd will send the node_up > message to AMFD? Is it the same like based on MDS callbacks and after > CLM reinitialization. > [Minh] It relied on MDS callback as before, but AMFND reuses the CLM > *info* before headless to fill node_up message >> +Imm Admin ops are not supported during headless since no active Amfd. >> +In general, the AMF functions that requires Amfd involvement is not >> supported >> +during headless. And all those functions will work as normal after >> SC comes back >> +from headless. >> + >> +LIMITATIONS >> +----------- >> + >> +* Recovery actions are limited while headless. >> +Failover/Switchover is delayed until SC recovery. >> + >> +* Delayed failover recovery is supported for 2N Service Group >> +Only for 2N Service Group, delayed failover recovery supports most >> of combination >> +assignment states (Quiesced/Quiescing/Standby/Active) which are left >> over >> +from headless. >> +EX: A Standby assignment will transition to Active directly if >> required, >> +a Quiesced/Quiescing assignment will be removed if admin entity is >> LOCKED, >> +or transition to Standby, etc... >> + >> +For other Service Group types, Standby assignments are first >> removed, and >> +reassigned as appropriate for the SG. >> + >> +* Tolerant timer of SI dependency. >> +After recovery from headless, if unassigned sponsor SI is detected, >> all its >> +dependent SI(s) assignment are removed regardless tolerant duration. >> The time >> +of sponsor SI becoming unassigned is not recorded so that the new >> amfd can >> +figure how much time is left that the dependent SI(s) can tolerate. >> + >> > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel