Re: [devel] Proof Of Concept patch reusing SG FSM code for better handling of transient nodes during headless state(was Re: [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620])

minh chau Wed, 09 Mar 2016 21:14:36 -0800

Hi Praveen

Thanks for PoC patch, I have been reading your patch, and here is my 
understanding, please correct me if I am wrong.
The approach of the patch in general is trying to pretend there's no 
headless gap, the operations before headless will resume after SC comes 
back.
To achieve this, node director now has to give more information about 
ha/assigning state so that director can resume sg fsm state.


The PoC patch seems to add more code for SI than the others, so I tried 
to play with it a bit
Below is my initial testing for 3 favorite test cases and findings:

1- Setup 2N app (act SU on PL4, stb SU on PL5). Stop SCs, stop PL4. 
Restart SC-1.
     I got this error
2016-03-10 13:11:35 PL-5 osafamfnd[418]: CR SU-SI record addition 
failed, SU= safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon : 
SI=safSi=AmfDemoTwon,safApp=AmfDemoTwon

-> There's no uncompleted admin op, so SG FSM state set as REALIGN, 
realign() will be called accordingly. At this moment, realign() is not 
able to bring the remaining STANDBY to ACTIVE (unless it's modified)

2- Setup 2N app (as 1-). Lock SI, delay csi_set callback for quiesced at 
"Assigning" state. Stop SCs, release csi_set cb at "Assigned" state. 
Restart SC-1
    I got 2 SUs: 1 STANDBY, 1 QUIESCED

3- Setup 2N app (as 1-). Lock SI, delay csi_set callback for quiesced at 
"Assigning" state. Stop SCs, restart SC-1, now release csi_set cb at 
"Assigned" state.
    I got 2 SUs: 1 STANDBY, 1 QUIESCED

-> I think 2- and 3- have the same root cause, after setting SG FSM 
state as SI_OPER, the corresponding SG FSM code should be called is 
si_admin_down(). I have tried to get it called in resume_sg_fsm() but it 
is not working, it requires @admin_si to be set and needs to be cleared 
at the end.

I have a doubt that these admin operation SG FSM code, all of those are 
normally started from the top sequence where are originated from IMM 
admin callback. Now these SG FSM code are called in the way which it is 
not supposed to be. I suspect there will be (many) changes in SG FSM 
code to get it work after headless.
Another thing, uncompleted admin op could be left over from headless, 
but there could be a node reboot due to error in the other nodes during 
headless. In such cases, wondering if these SG FSM currently can handle 
this or it could be stuck somewhere down the track.

The PoC patch is at very early stage I think, and at this moment I don't 
know if the approach is working until it goes to the end of the road.
I suggest to test the completed PoC patch for 2N as below:

- For each @entity in SI/SU/SG/node/nodegroup
     - For each @admin supported for this @entity
          - Issue @admin command
          - For each @callback for ACTIVE/STANDBY/QUIESCED/QUIESCING 
received at component
              - For each @delay of Assigning, Assigned in @callback
                 - Test 1: Stop SC, release @delay, start SC
                 - Test 2: Stop SC, release @delay, stop PL, start SC
                 - Test 3: Stop SC, start SC, release @delay
                 - Test 4: Stop SC, stop PL start SC, release @delay
                 Check if after headless, amf-state looks right

So the test (I hope) will scan through all SG FSM code of 2N

One minor clarification for delayed_failover approach: Amfd comes back 
from headless can not know what was happening during headless: SUSI 
could (or not) be completed, some of SUs were fail-overed, nodes could 
be rebooted, ... . Therefore, delayed_failover() works like a garbage 
collector, pick up inappropriate SUSI states, set them back to the right 
ones. Then SG FSM can start afterwards as STABLE state. In 
maintainability argument, it's likely a "plug-in" on top of current SG 
FSM code, and it's separated from SG FSM code. It currently works for 
the above test I suggested, though there could be something left to be 
improved.

For now, I don't know which approach is better than the other until the 
PoC patch is completed. The idea of PoC is good also, apps don't lose 
availability.

Thanks,
Minh


On 08/03/16 22:41, praveen malviya wrote:
> Please apply this patch on top of l to 4.
>
> Thanks
> Praveen
>
> On 08-Mar-16 5:08 PM, praveen malviya wrote:
>> Hi,
>>
>> Attached is the PoC patch that re-uses existing SG FSM code, by resuming
>> the SG FSM state after first controller comes up.
>> Thereby, the patch avoids rebooting of the node in transition.
>>
>> More about the patch below:
>>
>> With this approach SG FSM was successfully recovered for QUIESCED and
>> QUIESCING state transition of SUSI (SUs for each SIs)in the following
>> admin operations cases (without SI deps):
>> 1)SU SHUTDOWN and LOCK.
>> 2)SI LOCK and SHUTDOWN.
>> 3)SG LOCK and SHUTDOWN.
>>
>> After resuming of SG FSM state, SG moved to correct state after first
>> controller comes up and completed the admin operation as it does now in
>> normal cluster. Also UNLOCK operation was successful in all the cases.
>>
>>      In the delayed_failover approach (06-08), the problem was HA state
>> of SU for each SI was not considered and each SUSI was assumed assigned.
>> Because of this, original state of SU and hence SG FSM could not be
>> resumed.
>>
>> Approach in this SG FSM recovery patch:
>>      It recovers each SUSI FSM state and using this it resumes SG in
>> same FSM state as it was before controllers went down.Thus it will use
>> the original SG FSM code.
>>
>> Some benefits of this approach:
>>      1) Existing code of SG FSM can be used.
>>      2) Does not require node reboot in transition state.
>>      3) SG FSM code for each model already handles faults, si deps and
>> all admin operation so always any issue will just require deducing the
>> SG FSM state at the time of controller down and resuming SG in the same
>> state.
>>      4)There are FIVE SG FSM states in our code out of which STABLE
>> state of SG is not applicable for transition state. So there are only
>> FOUR SG fsm states to be resumed.
>>
>> Note: For testing admin op, cluster was freshly started for each lock
>> and shutdown operation as assignment counter related changes is not done
>> in this patch.
>>
>> Thanks,
>> Praveen.
>>
>>
>>
>> On 04-Mar-16 9:11 AM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Please see my comments in line with [Minh]
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 04/03/16 00:41, praveen malviya wrote:
>>>> Hi Minh,
>>>>
>>>> The second version of the patches you had published handles immediate
>>>> escalation only(1 to 4) but it does not performs 'immediate 
>>>> escalation'
>>>> during the transient phases.
>>> [Minh] The patch version is important to be sure we are in the same
>>> view. The latest version is V4 (not V2) that has immediate 
>>> escalation in
>>> amfnd. Perform "immediate escalation during transient phases" you mean
>>> to me is "reboot node that has transient SUSI", and it is suggested
>>> after V4 were published. As far as concerns, we agree to push 
>>> "immediate
>>> escalation" (amfnd) to base patches (#1 to #4) and separate "delayed
>>> failover" (amfd) to another patch. Then from there, we will review and
>>> see whether or not "delayed failover" is necessary
>>>>
>>>> So, the concept patch is not for "delayed failover" approach but for
>>>> doing 'Immediate escalation' during transient states also.
>>>> The 'immediate escalation' approach becomes **complete** with the
>>>> concept patch. Ofcourse, as mentioned before i would update the
>>>> concept patch further.
>>>>
>>>> Regarding the scanning of SUSIs in SG, it is scanned just to know the
>>>> active and standby SU but not to handle the transition state at susi
>>>> level. After rebooting the node, existing node-failover functionality
>>>> of SG FSM will take care of things at SUSI level including si deps for
>>>> all red models. In fact, later on the patch can be evolved to call
>>>> existing SG FSM code.
>>> [Minh] As mentioned in previous email, I understand the concept 
>>> patch is
>>> under going and issues will be fixed eventually (I would rather say a
>>> completion). But my question was on the *value* it gives at the end.
>>> Many healthy applications will claim losing availability since a node
>>> reboot because of a transient SUSI in another (unimportant) one, and
>>> node reboot is unexpected per configuration
>>> As I understand the complexity/maintainability of AMF code is important
>>> for maintainers, but is there any other reasons that support "immediate
>>> escalation"? If it's the case, the concept patch seems to sacrifice
>>> availability to gain less complexity/maintainability of code. But if we
>>> all agree with availability is most important, then
>>> complexity/maintainability is just matter of coding?
>>>>
>>>> I think in the version1 of patches, I had given comments for SI deps
>>>> and delayed fail-over getting mixed and the way SI dependecy has been
>>>> scanned. I never got the responses of those comments and other
>>>> comments of v1 on amfd patch. Those are important comments and needs
>>>> to be addressed.
>>> [Minh] We have received 2 emails for comments on V2 so far and all of
>>> those had been responded. In V4 we have corrected patches according to
>>> some of your comments
>>>
>>> Belows are date time of responses were sent
>>>
>>> Date: Fri, 12 Feb 2016 11:13:03 +1100
>>> From: minh chau<minh.c...@dektech.com.au>
>>> Subject: Re: [devel] [PATCH 1 of 5] amfd: Add README file for cloud
>>>      resilience support [#1620] V2
>>> To: praveen malviya<praveen.malv...@oracle.com>,
>>> hans.nordeb...@ericsson.com,gary....@dektech.com.au,
>>> nagendr...@oracle.com
>>> Cc:opensaf-devel@lists.sourceforge.net
>>>
>>>
>>> Date: Fri, 26 Feb 2016 14:41:18 +1100
>>> From: minh chau<minh.c...@dektech.com.au>
>>> Subject: Re: [devel] [PATCH 2 of 5] amfd: Add support for cloud
>>>      resilience at director [#1620] V2
>>> To: praveen malviya<praveen.malv...@oracle.com>,
>>> hans.nordeb...@ericsson.com,gary....@dektech.com.au,
>>> nagendr...@oracle.com
>>> Cc:opensaf-devel@lists.sourceforge.net
>>>>
>>>> Regarding the approach taken in delayed_failover() functionality, I do
>>>> not know whether it has been explored or not, but it does not use
>>>> existing SG FSM code. Using the existing code will keep it simple too.
>>> [Minh] What's SG FSM code you think it should be used? If there's any
>>> inappropriate codes, can we all go through it and optimize it?
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>> On 03-Mar-16 1:20 PM, minh chau wrote:
>>>>> Hi Nagu, Praveen,
>>>>>
>>>>> I have been trying your patch, with the test case below:
>>>>> Setup 2N model, PL4 host SU4 (act), PL5 host SU5(stb)
>>>>> 1. issue admin command shutdown SG
>>>>> 2. Hanging quiescing csi_set callback
>>>>> 3. Stop both SCs
>>>>> 4. Stop PL4
>>>>> 5. Restart both SCs
>>>>>
>>>>> I have seen this error after SCs come back also:
>>>>> SC-2 osafamfd[477]: ER avd_ckpt_siass:
>>>>> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon
>>>>> safSi=AmfDemoTwon,safApp=AmfDemoTwon does not exist
>>>>>
>>>>>  From trace file, after amfd sends reboot message to PL5, 
>>>>> realign() is
>>>>> called. Then realign() creates duplicated SUSI for SU5, this 
>>>>> duplicated
>>>>> SUSI is not checked point at SC-2.
>>>>> PL5 reboot, node_fail() of 2N SG calls AVD_SU::delete_all_susis() to
>>>>> delete all SUSI of SU5. Now 2 duplicated SUSI are deleted and
>>>>> checkpointed, the second one will cause "ER avd_ckpt_siass: ... does
>>>>> not
>>>>> exisit"
>>>>>
>>>>> This error should be happening with lock/shutdown 
>>>>> SG/SU/Node/NodeGroup.
>>>>> And Nodegroup is being stuck in SHUTTING_DOWN
>>>>> I think these kinds of issue will be fixed by you eventually, but
>>>>> all of
>>>>> these, looking through the concept patch, the
>>>>> complexity/maintainability
>>>>> is similar to patch #6. Both have to scan through all SU/SI to
>>>>> determine
>>>>> transient SUSI. The difference is decision to be made, one can reboot
>>>>> the node, another can adjust the state. Though it seems rebooting 
>>>>> node
>>>>> will loose the availability?
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>>
>>>>>
>>>>> On 03/03/16 11:32, minh chau wrote:
>>>>>> Hi Nagu, Praveen
>>>>>>
>>>>>> From patch 09 to patch 14, they are fixes for bugs that you also 
>>>>>> need
>>>>>> on top of patches #4.
>>>>>> The problems you reported should not happen if you have them. 
>>>>>> They are
>>>>>> regardless whether we *reboot node if transient states* or *adjust
>>>>>> transient states* (delayed failover).
>>>>>>
>>>>>> Patch 09 -> Return TRY_AGAIN for pg track start/stop in headless
>>>>>> Patch 10 -> Resend pg information to directors after headless
>>>>>> Patch 11 -> There are two fixes in this patch: (11_1) Fix mapping 
>>>>>> su,
>>>>>> and (11_2) fix amfnd coredump given that we allow comp/su failover
>>>>>> (patch #5). I split them
>>>>>> Patch 12 -> Do not disable healthy SU
>>>>>> Patch 13 -> It's for one payload limitation
>>>>>> Patch 14 -> It's for transient state at csi level, written on top of
>>>>>> patch #6.
>>>>>>
>>>>>> So you also need patch 09, 10, 11_1, 12, 13 on top of patch #4, and
>>>>>> they need to be reviewed and pushed together with #1->#4 as well.
>>>>>>
>>>>>> The patch #5 #6 #7 #8 are on different view from "immediate
>>>>>> escalation" and "reboot node if transient states".
>>>>>>
>>>>>> We will look at your assignment_recovery.patch.
>>>>>>
>>>>>> I also attach patch 1620_amfd_adjust_csi_V2.diff, which is to fix 
>>>>>> the
>>>>>> issue in TC #27, but it also depends on conclusion of how to deal 
>>>>>> with
>>>>>> transient states after headless.
>>>>>>
>>>>>> Thanks,
>>>>>> Minh
>>>>>>
>>>>>> On 03/03/16 02:12, Nagendra Kumar wrote:
>>>>>>>
>>>>>>> #1 I have applied patches #1 to #4 only. With this patches(not 
>>>>>>> having
>>>>>>> patch #6), I thought to have passed most of the following tests, 
>>>>>>> but
>>>>>>> they got failed(Listed below).
>>>>>>>
>>>>>>> I could not test other scenarios (including alarms and
>>>>>>> notifications), because I haven’t applied patch #6. I think there
>>>>>>> should be a simple patch replacing patch #6, which handles 
>>>>>>> transient
>>>>>>> state as ‘reboot the node‘ if Amf finds SUSI in transient state on
>>>>>>> that node.
>>>>>>>
>>>>>>> I am attaching a concept patch(assignment_recovery.patch), which 
>>>>>>> pass
>>>>>>> some of the scenarios and we are testing and enhancing it.
>>>>>>>
>>>>>>> As Praveen has suggested that we need to reboot the node which is
>>>>>>> undergoing in transient state to make it simple.
>>>>>>>
>>>>>>> This patch reduces complexity and maintainability.
>>>>>>>
>>>>>>> So, ACK for patch #1-#4 along with the attached patch.
>>>>>>>
>>>>>>> Please note that the attached patch has been created on patch #6 of
>>>>>>> yours, so please apply #1 to #4 and then #6 and then the attached
>>>>>>> patch.
>>>>>>>
>>>>>>> Currently the patch is for 2N red model. We are working to make for
>>>>>>> Nway Act and No red model (and possibly for Nway and NpM), we will
>>>>>>> publish it tomorrow.
>>>>>>>
>>>>>>> TC #1:
>>>>>>>
>>>>>>> Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover
>>>>>>> as false) and logs attached(TC 1) in the ticket.
>>>>>>>
>>>>>>> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on
>>>>>>> SC-2.
>>>>>>>
>>>>>>> 2. Stop SC-1 and kill demo. It goes for comp failover as 
>>>>>>> configured.
>>>>>>> Ideally, node should reboot.
>>>>>>>
>>>>>>> 3. Start SC-1. After cluster timer expires, PL-4 got the following
>>>>>>> error messages:
>>>>>>>
>>>>>>> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition
>>>>>>> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 :
>>>>>>> SI=safSi=AmfDemo,safApp=AmfDemo1
>>>>>>>
>>>>>>> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition
>>>>>>> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 :
>>>>>>> SI=safSi=AmfDemo1,safApp=AmfDemo1
>>>>>>>
>>>>>>> There is no assignment given for SU1. SU2 has Standby assignments:
>>>>>>>
>>>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>>>
>>>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>>>
>>>>>>> Other problems: a.) Further command for locking SU1/SU2 fails in SG
>>>>>>> unstable error.
>>>>>>>
>>>>>>>                                 b.) Immlist if SU2 gives the below
>>>>>>> result, Standby assignment it prints as 4, which is wrong:
>>>>>>>
>>>>>>> saAmfSUNumCurrStandbySIs SA_UINT32_T  4 (0x4)
>>>>>>>
>>>>>>> saAmfSUNumCurrActiveSIs SA_UINT32_T  0 (0x0)
>>>>>>>
>>>>>>>                                 c.) Even if SC-2 joins, and you do
>>>>>>> failover/switchover of SC-1, still same as above.
>>>>>>>
>>>>>>> TC #2: After execution of TC #1, stop PL-3. In worst case, SU2
>>>>>>> assignment should change to Act, which is not happening. After
>>>>>>> stopping of PL-4 also, the same problems as TC #1. logs attached(TC
>>>>>>> 2).
>>>>>>>
>>>>>>> TC #3: After TC #2, start PL-3 and start SC-2.
>>>>>>>
>>>>>>>                 SU1 is instantiated, but no assignment and the same
>>>>>>> problem as above.
>>>>>>>
>>>>>>>                 When stop PL-4, SU1 gets assignments, the following
>>>>>>> logs comes at SC-2:
>>>>>>>
>>>>>>> Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass:
>>>>>>> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 
>>>>>>> safSi=AmfDemo,safApp=AmfDemo1
>>>>>>> does not exist
>>>>>>>
>>>>>>> Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass:
>>>>>>> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1
>>>>>>> safSi=AmfDemo1,safApp=AmfDemo1 does not exist
>>>>>>>
>>>>>>> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784933] tipc: Resetting link
>>>>>>> <1.1.2:eth0-1.1.4:eth0>, peer not responding
>>>>>>>
>>>>>>> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784947] tipc: Lost link
>>>>>>> <1.1.2:eth0-1.1.4:eth0> on network plane A
>>>>>>>
>>>>>>> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784956] tipc: Lost contact
>>>>>>> with <1.1.4>
>>>>>>>
>>>>>>> Start PL-4, SU2 gets Standby assignments and everything works fine
>>>>>>> after that.
>>>>>>>
>>>>>>> TC #4: Similar problems exist in the following test cases:
>>>>>>>
>>>>>>> a.)Configuration same as TC #1 except saAmfSutDefSUFailover as 
>>>>>>> true.
>>>>>>>
>>>>>>>                 After killing demo, PL-3 went for reboot.
>>>>>>>
>>>>>>>                 But the problem is the same as shown in TC #1, 
>>>>>>> TC #2
>>>>>>> and TC #3.
>>>>>>>
>>>>>>> b.) Configuration same as TC #1 except with
>>>>>>>  saAmfCtDefRecoveryOnError as 2 and saAmfCtDefDisableRestart as 1.
>>>>>>>
>>>>>>>                 But the problem is the same as shown in TC #1, 
>>>>>>> TC #2
>>>>>>> and TC #3.
>>>>>>>
>>>>>>> c.)Configuration same as TC #1 except with 
>>>>>>> saAmfCtDefRecoveryOnError
>>>>>>> as 2 and saAmfCtDefDisableRestart as 1 and saAmfSutDefSUFailover
>>>>>>> as 1.
>>>>>>>
>>>>>>>                 After killing demo, PL-3 went for reboot.
>>>>>>>
>>>>>>>                 But the problem is the same as shown in TC #1, 
>>>>>>> TC #2
>>>>>>> and TC #3.
>>>>>>>
>>>>>>> TC #5:  Configuration same as TC #1 except with
>>>>>>>  saAmfCtDefRecoveryOnError as 2. Configuration and logs(TC 5)
>>>>>>> attached in ticket.
>>>>>>>
>>>>>>> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on
>>>>>>> SC-2.
>>>>>>>
>>>>>>> 2. Stop SC-1 and kill demo. It goes for comp restart as configured.
>>>>>>>
>>>>>>> 3. Start SC-1. After SC-1 comes up and before cluster timer 
>>>>>>> expires,
>>>>>>> stop PL-3:
>>>>>>>
>>>>>>> Even if PL-3 is stopped(see below PL-3 is not available), SU1 is
>>>>>>> still having Act assignment and SU2 is having Standby assignment:
>>>>>>>
>>>>>>> PM_SC-1:/home/nagu/views/staging # amf-state siass
>>>>>>>
>>>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         saAmfSISUHAState=STANDBY(2)
>>>>>>>
>>>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>        saAmfSISUHAState=STANDBY(2)
>>>>>>>
>>>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         saAmfSISUHAState=ACTIVE(1)
>>>>>>>
>>>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         saAmfSISUHAState=ACTIVE(1)
>>>>>>>
>>>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         saAmfSISUHAState=ACTIVE(1)
>>>>>>>
>>>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         saAmfSISUHAState=ACTIVE(1)
>>>>>>>
>>>>>>>    saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         saAmfSISUHAState=ACTIVE(1)
>>>>>>>
>>>>>>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>>>>>>
>>>>>>> TC #6:  After TC #5, start PL-3:
>>>>>>>
>>>>>>> SU1 is not given any assignment (may be because it exists in Amfd
>>>>>>> db):
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 osafamfwd[8318]: Started
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO
>>>>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' Presence State 
>>>>>>> INSTANTIATING
>>>>>>> => INSTANTIATED
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigning
>>>>>>> 'safSi=NoRed2,safApp=OpenSAF' ACTIVE to
>>>>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigned
>>>>>>> 'safSi=NoRed2,safApp=OpenSAF' ACTIVE to
>>>>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO
>>>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State
>>>>>>> UNINSTANTIATED => INSTANTIATING
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 opensafd: OpenSAF(5.0.M0 -
>>>>>>> 7282:4fbffe857512:) services successfully started
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 amf_demo[8337]:
>>>>>>> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO
>>>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State
>>>>>>> INSTANTIATING => INSTANTIATED
>>>>>>>
>>>>>>> Mar  2 14:22:06 PM_PL-3 amf_demo[8337]: HC started with AMF
>>>>>>>
>>>>>>> TC #7:  After TC #6:
>>>>>>>
>>>>>>> Lock SU1: Amfnd of PL-3 throws error:
>>>>>>>
>>>>>>> Mar  2 14:23:57 PM_PL-3 osafamfnd[8259]: ER susi_assign_evh:
>>>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments
>>>>>>>
>>>>>>> This is obvious because, Amfnd doesn’t have any assignment.
>>>>>>>
>>>>>>> SU1 admin state is locked, but SUSI is being shown on SU1.
>>>>>>>
>>>>>>> TC #8:  After TC #7:
>>>>>>>
>>>>>>> Lock SU1, it throws error:
>>>>>>>
>>>>>>> Admin operation is already going on
>>>>>>> (su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
>>>>>>>
>>>>>>> TC #9:  Same as TC #6 except Configure saAmfCtDefRecoveryOnError as
>>>>>>> Node Switchover/Failover/Failfast.
>>>>>>>
>>>>>>> The problem reported in TC #4 exists.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> -Nagu
>>>>>>>
>>>>>>> > -----Original Message-----
>>>>>>>
>>>>>>> > From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>>>>>>>
>>>>>>> > Sent: 25 February 2016 14:14
>>>>>>>
>>>>>>> > To: hans.nordeb...@ericsson.com; gary....@dektech.com.au; 
>>>>>>> Nagendra
>>>>>>>
>>>>>>> > Kumar; Praveen Malviya; minh.c...@dektech.com.au
>>>>>>>
>>>>>>> > Cc: opensaf-devel@lists.sourceforge.net
>>>>>>>
>>>>>>> > Subject: [PATCH 01 of 15] amfd: Add support for cloud 
>>>>>>> resilience at
>>>>>>> common
>>>>>>>
>>>>>>> > libs [#1620]
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] Proof Of Concept patch reusing SG FSM code for better handling of transient nodes during headless state(was Re: [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620])

Reply via email to