Please apply this patch on top of l to 4. Thanks Praveen
On 08-Mar-16 5:08 PM, praveen malviya wrote: > Hi, > > Attached is the PoC patch that re-uses existing SG FSM code, by resuming > the SG FSM state after first controller comes up. > Thereby, the patch avoids rebooting of the node in transition. > > More about the patch below: > > With this approach SG FSM was successfully recovered for QUIESCED and > QUIESCING state transition of SUSI (SUs for each SIs)in the following > admin operations cases (without SI deps): > 1)SU SHUTDOWN and LOCK. > 2)SI LOCK and SHUTDOWN. > 3)SG LOCK and SHUTDOWN. > > After resuming of SG FSM state, SG moved to correct state after first > controller comes up and completed the admin operation as it does now in > normal cluster. Also UNLOCK operation was successful in all the cases. > > In the delayed_failover approach (06-08), the problem was HA state > of SU for each SI was not considered and each SUSI was assumed assigned. > Because of this, original state of SU and hence SG FSM could not be > resumed. > > Approach in this SG FSM recovery patch: > It recovers each SUSI FSM state and using this it resumes SG in > same FSM state as it was before controllers went down.Thus it will use > the original SG FSM code. > > Some benefits of this approach: > 1) Existing code of SG FSM can be used. > 2) Does not require node reboot in transition state. > 3) SG FSM code for each model already handles faults, si deps and > all admin operation so always any issue will just require deducing the > SG FSM state at the time of controller down and resuming SG in the same > state. > 4)There are FIVE SG FSM states in our code out of which STABLE > state of SG is not applicable for transition state. So there are only > FOUR SG fsm states to be resumed. > > Note: For testing admin op, cluster was freshly started for each lock > and shutdown operation as assignment counter related changes is not done > in this patch. > > Thanks, > Praveen. > > > > On 04-Mar-16 9:11 AM, minh chau wrote: >> Hi Praveen, >> >> Please see my comments in line with [Minh] >> >> Thanks, >> Minh >> >> On 04/03/16 00:41, praveen malviya wrote: >>> Hi Minh, >>> >>> The second version of the patches you had published handles immediate >>> escalation only(1 to 4) but it does not performs 'immediate escalation' >>> during the transient phases. >> [Minh] The patch version is important to be sure we are in the same >> view. The latest version is V4 (not V2) that has immediate escalation in >> amfnd. Perform "immediate escalation during transient phases" you mean >> to me is "reboot node that has transient SUSI", and it is suggested >> after V4 were published. As far as concerns, we agree to push "immediate >> escalation" (amfnd) to base patches (#1 to #4) and separate "delayed >> failover" (amfd) to another patch. Then from there, we will review and >> see whether or not "delayed failover" is necessary >>> >>> So, the concept patch is not for "delayed failover" approach but for >>> doing 'Immediate escalation' during transient states also. >>> The 'immediate escalation' approach becomes **complete** with the >>> concept patch. Ofcourse, as mentioned before i would update the >>> concept patch further. >>> >>> Regarding the scanning of SUSIs in SG, it is scanned just to know the >>> active and standby SU but not to handle the transition state at susi >>> level. After rebooting the node, existing node-failover functionality >>> of SG FSM will take care of things at SUSI level including si deps for >>> all red models. In fact, later on the patch can be evolved to call >>> existing SG FSM code. >> [Minh] As mentioned in previous email, I understand the concept patch is >> under going and issues will be fixed eventually (I would rather say a >> completion). But my question was on the *value* it gives at the end. >> Many healthy applications will claim losing availability since a node >> reboot because of a transient SUSI in another (unimportant) one, and >> node reboot is unexpected per configuration >> As I understand the complexity/maintainability of AMF code is important >> for maintainers, but is there any other reasons that support "immediate >> escalation"? If it's the case, the concept patch seems to sacrifice >> availability to gain less complexity/maintainability of code. But if we >> all agree with availability is most important, then >> complexity/maintainability is just matter of coding? >>> >>> I think in the version1 of patches, I had given comments for SI deps >>> and delayed fail-over getting mixed and the way SI dependecy has been >>> scanned. I never got the responses of those comments and other >>> comments of v1 on amfd patch. Those are important comments and needs >>> to be addressed. >> [Minh] We have received 2 emails for comments on V2 so far and all of >> those had been responded. In V4 we have corrected patches according to >> some of your comments >> >> Belows are date time of responses were sent >> >> Date: Fri, 12 Feb 2016 11:13:03 +1100 >> From: minh chau<minh.c...@dektech.com.au> >> Subject: Re: [devel] [PATCH 1 of 5] amfd: Add README file for cloud >> resilience support [#1620] V2 >> To: praveen malviya<praveen.malv...@oracle.com>, >> hans.nordeb...@ericsson.com,gary....@dektech.com.au, >> nagendr...@oracle.com >> Cc:opensaf-devel@lists.sourceforge.net >> >> >> Date: Fri, 26 Feb 2016 14:41:18 +1100 >> From: minh chau<minh.c...@dektech.com.au> >> Subject: Re: [devel] [PATCH 2 of 5] amfd: Add support for cloud >> resilience at director [#1620] V2 >> To: praveen malviya<praveen.malv...@oracle.com>, >> hans.nordeb...@ericsson.com,gary....@dektech.com.au, >> nagendr...@oracle.com >> Cc:opensaf-devel@lists.sourceforge.net >>> >>> Regarding the approach taken in delayed_failover() functionality, I do >>> not know whether it has been explored or not, but it does not use >>> existing SG FSM code. Using the existing code will keep it simple too. >> [Minh] What's SG FSM code you think it should be used? If there's any >> inappropriate codes, can we all go through it and optimize it? >>> >>> Thanks, >>> Praveen >>> >>> On 03-Mar-16 1:20 PM, minh chau wrote: >>>> Hi Nagu, Praveen, >>>> >>>> I have been trying your patch, with the test case below: >>>> Setup 2N model, PL4 host SU4 (act), PL5 host SU5(stb) >>>> 1. issue admin command shutdown SG >>>> 2. Hanging quiescing csi_set callback >>>> 3. Stop both SCs >>>> 4. Stop PL4 >>>> 5. Restart both SCs >>>> >>>> I have seen this error after SCs come back also: >>>> SC-2 osafamfd[477]: ER avd_ckpt_siass: >>>> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon >>>> safSi=AmfDemoTwon,safApp=AmfDemoTwon does not exist >>>> >>>> From trace file, after amfd sends reboot message to PL5, realign() is >>>> called. Then realign() creates duplicated SUSI for SU5, this duplicated >>>> SUSI is not checked point at SC-2. >>>> PL5 reboot, node_fail() of 2N SG calls AVD_SU::delete_all_susis() to >>>> delete all SUSI of SU5. Now 2 duplicated SUSI are deleted and >>>> checkpointed, the second one will cause "ER avd_ckpt_siass: ... does >>>> not >>>> exisit" >>>> >>>> This error should be happening with lock/shutdown SG/SU/Node/NodeGroup. >>>> And Nodegroup is being stuck in SHUTTING_DOWN >>>> I think these kinds of issue will be fixed by you eventually, but >>>> all of >>>> these, looking through the concept patch, the >>>> complexity/maintainability >>>> is similar to patch #6. Both have to scan through all SU/SI to >>>> determine >>>> transient SUSI. The difference is decision to be made, one can reboot >>>> the node, another can adjust the state. Though it seems rebooting node >>>> will loose the availability? >>>> >>>> Thanks, >>>> Minh >>>> >>>> >>>> On 03/03/16 11:32, minh chau wrote: >>>>> Hi Nagu, Praveen >>>>> >>>>> From patch 09 to patch 14, they are fixes for bugs that you also need >>>>> on top of patches #4. >>>>> The problems you reported should not happen if you have them. They are >>>>> regardless whether we *reboot node if transient states* or *adjust >>>>> transient states* (delayed failover). >>>>> >>>>> Patch 09 -> Return TRY_AGAIN for pg track start/stop in headless >>>>> Patch 10 -> Resend pg information to directors after headless >>>>> Patch 11 -> There are two fixes in this patch: (11_1) Fix mapping su, >>>>> and (11_2) fix amfnd coredump given that we allow comp/su failover >>>>> (patch #5). I split them >>>>> Patch 12 -> Do not disable healthy SU >>>>> Patch 13 -> It's for one payload limitation >>>>> Patch 14 -> It's for transient state at csi level, written on top of >>>>> patch #6. >>>>> >>>>> So you also need patch 09, 10, 11_1, 12, 13 on top of patch #4, and >>>>> they need to be reviewed and pushed together with #1->#4 as well. >>>>> >>>>> The patch #5 #6 #7 #8 are on different view from "immediate >>>>> escalation" and "reboot node if transient states". >>>>> >>>>> We will look at your assignment_recovery.patch. >>>>> >>>>> I also attach patch 1620_amfd_adjust_csi_V2.diff, which is to fix the >>>>> issue in TC #27, but it also depends on conclusion of how to deal with >>>>> transient states after headless. >>>>> >>>>> Thanks, >>>>> Minh >>>>> >>>>> On 03/03/16 02:12, Nagendra Kumar wrote: >>>>>> >>>>>> #1 I have applied patches #1 to #4 only. With this patches(not having >>>>>> patch #6), I thought to have passed most of the following tests, but >>>>>> they got failed(Listed below). >>>>>> >>>>>> I could not test other scenarios (including alarms and >>>>>> notifications), because I haven’t applied patch #6. I think there >>>>>> should be a simple patch replacing patch #6, which handles transient >>>>>> state as ‘reboot the node‘ if Amf finds SUSI in transient state on >>>>>> that node. >>>>>> >>>>>> I am attaching a concept patch(assignment_recovery.patch), which pass >>>>>> some of the scenarios and we are testing and enhancing it. >>>>>> >>>>>> As Praveen has suggested that we need to reboot the node which is >>>>>> undergoing in transient state to make it simple. >>>>>> >>>>>> This patch reduces complexity and maintainability. >>>>>> >>>>>> So, ACK for patch #1-#4 along with the attached patch. >>>>>> >>>>>> Please note that the attached patch has been created on patch #6 of >>>>>> yours, so please apply #1 to #4 and then #6 and then the attached >>>>>> patch. >>>>>> >>>>>> Currently the patch is for 2N red model. We are working to make for >>>>>> Nway Act and No red model (and possibly for Nway and NpM), we will >>>>>> publish it tomorrow. >>>>>> >>>>>> TC #1: >>>>>> >>>>>> Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover >>>>>> as false) and logs attached(TC 1) in the ticket. >>>>>> >>>>>> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on >>>>>> SC-2. >>>>>> >>>>>> 2. Stop SC-1 and kill demo. It goes for comp failover as configured. >>>>>> Ideally, node should reboot. >>>>>> >>>>>> 3. Start SC-1. After cluster timer expires, PL-4 got the following >>>>>> error messages: >>>>>> >>>>>> Mar 2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition >>>>>> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : >>>>>> SI=safSi=AmfDemo,safApp=AmfDemo1 >>>>>> >>>>>> Mar 2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition >>>>>> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : >>>>>> SI=safSi=AmfDemo1,safApp=AmfDemo1 >>>>>> >>>>>> There is no assignment given for SU1. SU2 has Standby assignments: >>>>>> >>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1 >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1 >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> >>>>>> Other problems: a.) Further command for locking SU1/SU2 fails in SG >>>>>> unstable error. >>>>>> >>>>>> b.) Immlist if SU2 gives the below >>>>>> result, Standby assignment it prints as 4, which is wrong: >>>>>> >>>>>> saAmfSUNumCurrStandbySIs SA_UINT32_T 4 (0x4) >>>>>> >>>>>> saAmfSUNumCurrActiveSIs SA_UINT32_T 0 (0x0) >>>>>> >>>>>> c.) Even if SC-2 joins, and you do >>>>>> failover/switchover of SC-1, still same as above. >>>>>> >>>>>> TC #2: After execution of TC #1, stop PL-3. In worst case, SU2 >>>>>> assignment should change to Act, which is not happening. After >>>>>> stopping of PL-4 also, the same problems as TC #1. logs attached(TC >>>>>> 2). >>>>>> >>>>>> TC #3: After TC #2, start PL-3 and start SC-2. >>>>>> >>>>>> SU1 is instantiated, but no assignment and the same >>>>>> problem as above. >>>>>> >>>>>> When stop PL-4, SU1 gets assignments, the following >>>>>> logs comes at SC-2: >>>>>> >>>>>> Mar 2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: >>>>>> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo,safApp=AmfDemo1 >>>>>> does not exist >>>>>> >>>>>> Mar 2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: >>>>>> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 >>>>>> safSi=AmfDemo1,safApp=AmfDemo1 does not exist >>>>>> >>>>>> Mar 2 09:06:21 PM_SC-2 kernel: [ 3290.784933] tipc: Resetting link >>>>>> <1.1.2:eth0-1.1.4:eth0>, peer not responding >>>>>> >>>>>> Mar 2 09:06:21 PM_SC-2 kernel: [ 3290.784947] tipc: Lost link >>>>>> <1.1.2:eth0-1.1.4:eth0> on network plane A >>>>>> >>>>>> Mar 2 09:06:21 PM_SC-2 kernel: [ 3290.784956] tipc: Lost contact >>>>>> with <1.1.4> >>>>>> >>>>>> Start PL-4, SU2 gets Standby assignments and everything works fine >>>>>> after that. >>>>>> >>>>>> TC #4: Similar problems exist in the following test cases: >>>>>> >>>>>> a.)Configuration same as TC #1 except saAmfSutDefSUFailover as true. >>>>>> >>>>>> After killing demo, PL-3 went for reboot. >>>>>> >>>>>> But the problem is the same as shown in TC #1, TC #2 >>>>>> and TC #3. >>>>>> >>>>>> b.) Configuration same as TC #1 except with >>>>>> saAmfCtDefRecoveryOnError as 2 and saAmfCtDefDisableRestart as 1. >>>>>> >>>>>> But the problem is the same as shown in TC #1, TC #2 >>>>>> and TC #3. >>>>>> >>>>>> c.)Configuration same as TC #1 except with saAmfCtDefRecoveryOnError >>>>>> as 2 and saAmfCtDefDisableRestart as 1 and saAmfSutDefSUFailover >>>>>> as 1. >>>>>> >>>>>> After killing demo, PL-3 went for reboot. >>>>>> >>>>>> But the problem is the same as shown in TC #1, TC #2 >>>>>> and TC #3. >>>>>> >>>>>> TC #5: Configuration same as TC #1 except with >>>>>> saAmfCtDefRecoveryOnError as 2. Configuration and logs(TC 5) >>>>>> attached in ticket. >>>>>> >>>>>> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on >>>>>> SC-2. >>>>>> >>>>>> 2. Stop SC-1 and kill demo. It goes for comp restart as configured. >>>>>> >>>>>> 3. Start SC-1. After SC-1 comes up and before cluster timer expires, >>>>>> stop PL-3: >>>>>> >>>>>> Even if PL-3 is stopped(see below PL-3 is not available), SU1 is >>>>>> still having Act assignment and SU2 is having Standby assignment: >>>>>> >>>>>> PM_SC-1:/home/nagu/views/staging # amf-state siass >>>>>> >>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1 >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1 >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1 >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1 >>>>>> >>>>>> >>>>>> >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> >>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1) >>>>>> >>>>>> TC #6: After TC #5, start PL-3: >>>>>> >>>>>> SU1 is not given any assignment (may be because it exists in Amfd >>>>>> db): >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 osafamfwd[8318]: Started >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 osafamfnd[8259]: NO >>>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' Presence State INSTANTIATING >>>>>> => INSTANTIATED >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigning >>>>>> 'safSi=NoRed2,safApp=OpenSAF' ACTIVE to >>>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigned >>>>>> 'safSi=NoRed2,safApp=OpenSAF' ACTIVE to >>>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 osafamfnd[8259]: NO >>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State >>>>>> UNINSTANTIATED => INSTANTIATING >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 opensafd: OpenSAF(5.0.M0 - >>>>>> 7282:4fbffe857512:) services successfully started >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 amf_demo[8337]: >>>>>> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 osafamfnd[8259]: NO >>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State >>>>>> INSTANTIATING => INSTANTIATED >>>>>> >>>>>> Mar 2 14:22:06 PM_PL-3 amf_demo[8337]: HC started with AMF >>>>>> >>>>>> TC #7: After TC #6: >>>>>> >>>>>> Lock SU1: Amfnd of PL-3 throws error: >>>>>> >>>>>> Mar 2 14:23:57 PM_PL-3 osafamfnd[8259]: ER susi_assign_evh: >>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments >>>>>> >>>>>> This is obvious because, Amfnd doesn’t have any assignment. >>>>>> >>>>>> SU1 admin state is locked, but SUSI is being shown on SU1. >>>>>> >>>>>> TC #8: After TC #7: >>>>>> >>>>>> Lock SU1, it throws error: >>>>>> >>>>>> Admin operation is already going on >>>>>> (su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 >>>>>> >>>>>> TC #9: Same as TC #6 except Configure saAmfCtDefRecoveryOnError as >>>>>> Node Switchover/Failover/Failfast. >>>>>> >>>>>> The problem reported in TC #4 exists. >>>>>> >>>>>> Thanks >>>>>> >>>>>> -Nagu >>>>>> >>>>>> > -----Original Message----- >>>>>> >>>>>> > From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] >>>>>> >>>>>> > Sent: 25 February 2016 14:14 >>>>>> >>>>>> > To: hans.nordeb...@ericsson.com; gary....@dektech.com.au; Nagendra >>>>>> >>>>>> > Kumar; Praveen Malviya; minh.c...@dektech.com.au >>>>>> >>>>>> > Cc: opensaf-devel@lists.sourceforge.net >>>>>> >>>>>> > Subject: [PATCH 01 of 15] amfd: Add support for cloud resilience at >>>>>> common >>>>>> >>>>>> > libs [#1620] >>>>>> >>>>> >>>> >>> >> ------------------------------------------------------------------------------ Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://makebettercode.com/inteldaal-eval _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel