Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

minh chau Wed, 02 Mar 2016 23:51:07 -0800

Hi Nagu, Praveen,

I have been trying your patch, with the test case below:
Setup 2N model, PL4 host SU4 (act), PL5 host SU5(stb)
1. issue admin command shutdown SG
2. Hanging quiescing csi_set callback
3. Stop both SCs
4. Stop PL4
5. Restart both SCs


I have seen this error after SCs come back also:
SC-2 osafamfd[477]: ER avd_ckpt_siass: 
safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon 
safSi=AmfDemoTwon,safApp=AmfDemoTwon does not exist

 From trace file, after amfd sends reboot message to PL5, realign() is 
called. Then realign() creates duplicated SUSI for SU5, this duplicated 
SUSI is not checked point at SC-2.
PL5 reboot, node_fail() of 2N SG calls AVD_SU::delete_all_susis() to 
delete all SUSI of SU5. Now 2 duplicated SUSI are deleted and 
checkpointed, the second one will cause "ER avd_ckpt_siass: ... does not 
exisit"

This error should be happening with lock/shutdown SG/SU/Node/NodeGroup. 
And Nodegroup is being stuck in SHUTTING_DOWN
I think these kinds of issue will be fixed by you eventually, but all of 
these, looking through the concept patch, the complexity/maintainability 
is similar to patch #6. Both have to scan through all SU/SI to determine 
transient SUSI. The difference is decision to be made, one can reboot 
the node, another can adjust the state. Though it seems rebooting node 
will loose the availability?

Thanks,
Minh


On 03/03/16 11:32, minh chau wrote:
> Hi Nagu, Praveen
>
> From patch 09 to patch 14, they are fixes for bugs that you also need 
> on top of patches #4.
> The problems you reported should not happen if you have them. They are 
> regardless whether we *reboot node if transient states* or *adjust 
> transient states* (delayed failover).
>
> Patch 09 -> Return TRY_AGAIN for pg track start/stop in headless
> Patch 10 -> Resend pg information to directors after headless
> Patch 11 -> There are two fixes in this patch: (11_1) Fix mapping su, 
> and (11_2) fix amfnd coredump given that we allow comp/su failover 
> (patch #5). I split them
> Patch 12 -> Do not disable healthy SU
> Patch 13 -> It's for one payload limitation
> Patch 14 -> It's for transient state at csi level, written on top of 
> patch #6.
>
> So you also need patch 09, 10, 11_1, 12, 13 on top of patch #4, and 
> they need to be reviewed and pushed together with #1->#4 as well.
>
> The patch #5 #6 #7 #8 are on different view from "immediate 
> escalation" and "reboot node if transient states".
>
> We will look at your assignment_recovery.patch.
>
> I also attach patch 1620_amfd_adjust_csi_V2.diff, which is to fix the 
> issue in TC #27, but it also depends on conclusion of how to deal with 
> transient states after headless.
>
> Thanks,
> Minh
>
> On 03/03/16 02:12, Nagendra Kumar wrote:
>>
>> #1 I have applied patches #1 to #4 only. With this patches(not having 
>> patch #6), I thought to have passed most of the following tests, but 
>> they got failed(Listed below).
>>
>> I could not test other scenarios (including alarms and 
>> notifications), because I haven’t applied patch #6. I think there 
>> should be a simple patch replacing patch #6, which handles transient 
>> state as ‘reboot the node‘ if Amf finds SUSI in transient state on 
>> that node.
>>
>> I am attaching a concept patch(assignment_recovery.patch), which pass 
>> some of the scenarios and we are testing and enhancing it.
>>
>> As Praveen has suggested that we need to reboot the node which is 
>> undergoing in transient state to make it simple.
>>
>> This patch reduces complexity and maintainability.
>>
>> So, ACK for patch #1-#4 along with the attached patch.
>>
>> Please note that the attached patch has been created on patch #6 of 
>> yours, so please apply #1 to #4 and then #6 and then the attached patch.
>>
>> Currently the patch is for 2N red model. We are working to make for 
>> Nway Act and No red model (and possibly for Nway and NpM), we will 
>> publish it tomorrow.
>>
>> TC #1:
>>
>> Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover 
>> as false) and logs attached(TC 1) in the ticket.
>>
>> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.
>>
>> 2. Stop SC-1 and kill demo. It goes for comp failover as configured. 
>> Ideally, node should reboot.
>>
>> 3. Start SC-1. After cluster timer expires, PL-4 got the following 
>> error messages:
>>
>> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition 
>> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
>> SI=safSi=AmfDemo,safApp=AmfDemo1
>>
>> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition 
>> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
>> SI=safSi=AmfDemo1,safApp=AmfDemo1
>>
>> There is no assignment given for SU1. SU2 has Standby assignments:
>>
>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> Other problems: a.) Further command for locking SU1/SU2 fails in SG 
>> unstable error.
>>
>>                                 b.) Immlist if SU2 gives the below 
>> result, Standby assignment it prints as 4, which is wrong:
>>
>> saAmfSUNumCurrStandbySIs SA_UINT32_T  4 (0x4)
>>
>> saAmfSUNumCurrActiveSIs SA_UINT32_T  0 (0x0)
>>
>>                                 c.) Even if SC-2 joins, and you do 
>> failover/switchover of SC-1, still same as above.
>>
>> TC #2: After execution of TC #1, stop PL-3. In worst case, SU2 
>> assignment should change to Act, which is not happening. After 
>> stopping of PL-4 also, the same problems as TC #1. logs attached(TC 2).
>>
>> TC #3: After TC #2, start PL-3 and start SC-2.
>>
>>                 SU1 is instantiated, but no assignment and the same 
>> problem as above.
>>
>>                 When stop PL-4, SU1 gets assignments, the following 
>> logs comes at SC-2:
>>
>> Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
>> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo,safApp=AmfDemo1 
>> does not exist
>>
>> Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
>> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 
>> safSi=AmfDemo1,safApp=AmfDemo1 does not exist
>>
>> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784933] tipc: Resetting link 
>> <1.1.2:eth0-1.1.4:eth0>, peer not responding
>>
>> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784947] tipc: Lost link 
>> <1.1.2:eth0-1.1.4:eth0> on network plane A
>>
>> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784956] tipc: Lost contact 
>> with <1.1.4>
>>
>> Start PL-4, SU2 gets Standby assignments and everything works fine 
>> after that.
>>
>> TC #4: Similar problems exist in the following test cases:
>>
>> a.)Configuration same as TC #1 except saAmfSutDefSUFailover as true.
>>
>>                 After killing demo, PL-3 went for reboot.
>>
>>                 But the problem is the same as shown in TC #1, TC #2 
>> and TC #3.
>>
>> b.) Configuration same as TC #1 except with 
>>  saAmfCtDefRecoveryOnError as 2 and saAmfCtDefDisableRestart as 1.
>>
>>                 But the problem is the same as shown in TC #1, TC #2 
>> and TC #3.
>>
>> c.)Configuration same as TC #1 except with  saAmfCtDefRecoveryOnError 
>> as 2 and saAmfCtDefDisableRestart as 1 and saAmfSutDefSUFailover as 1.
>>
>>                 After killing demo, PL-3 went for reboot.
>>
>>                 But the problem is the same as shown in TC #1, TC #2 
>> and TC #3.
>>
>> TC #5:  Configuration same as TC #1 except with 
>>  saAmfCtDefRecoveryOnError as 2. Configuration and logs(TC 5) 
>> attached in ticket.
>>
>> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.
>>
>> 2. Stop SC-1 and kill demo. It goes for comp restart as configured.
>>
>> 3. Start SC-1. After SC-1 comes up and before cluster timer expires, 
>> stop PL-3:
>>
>> Even if PL-3 is stopped(see below PL-3 is not available), SU1 is 
>> still having Act assignment and SU2 is having Standby assignment:
>>
>> PM_SC-1:/home/nagu/views/staging # amf-state siass
>>
>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>>
>>         saAmfSISUHAState=STANDBY(2)
>>
>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>>
>>        saAmfSISUHAState=STANDBY(2)
>>
>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
>>
>>         saAmfSISUHAState=ACTIVE(1)
>>
>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
>>
>>         saAmfSISUHAState=ACTIVE(1)
>>
>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>>
>>         saAmfSISUHAState=ACTIVE(1)
>>
>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>
>>         saAmfSISUHAState=ACTIVE(1)
>>
>>    saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>>
>>         saAmfSISUHAState=ACTIVE(1)
>>
>> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>>
>> TC #6:  After TC #5, start PL-3:
>>
>> SU1 is not given any assignment (may be because it exists in Amfd db):
>>
>> Mar  2 14:22:06 PM_PL-3 osafamfwd[8318]: Started
>>
>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO 
>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' Presence State INSTANTIATING 
>> => INSTANTIATED
>>
>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigning 
>> 'safSi=NoRed2,safApp=OpenSAF' ACTIVE to 
>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
>>
>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigned 
>> 'safSi=NoRed2,safApp=OpenSAF' ACTIVE to 
>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
>>
>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO 
>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State 
>> UNINSTANTIATED => INSTANTIATING
>>
>> Mar  2 14:22:06 PM_PL-3 opensafd: OpenSAF(5.0.M0 - 
>> 7282:4fbffe857512:) services successfully started
>>
>> Mar  2 14:22:06 PM_PL-3 amf_demo[8337]: 
>> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started
>>
>> Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO 
>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State 
>> INSTANTIATING => INSTANTIATED
>>
>> Mar  2 14:22:06 PM_PL-3 amf_demo[8337]: HC started with AMF
>>
>> TC #7:  After TC #6:
>>
>> Lock SU1: Amfnd of PL-3 throws error:
>>
>> Mar  2 14:23:57 PM_PL-3 osafamfnd[8259]: ER susi_assign_evh: 
>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments
>>
>> This is obvious because, Amfnd doesn’t have any assignment.
>>
>> SU1 admin state is locked, but SUSI is being shown on SU1.
>>
>> TC #8:  After TC #7:
>>
>> Lock SU1, it throws error:
>>
>> Admin operation is already going on 
>> (su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
>>
>> TC #9:  Same as TC #6 except Configure saAmfCtDefRecoveryOnError as 
>> Node Switchover/Failover/Failfast.
>>
>> The problem reported in TC #4 exists.
>>
>> Thanks
>>
>> -Nagu
>>
>> > -----Original Message-----
>>
>> > From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>>
>> > Sent: 25 February 2016 14:14
>>
>> > To: hans.nordeb...@ericsson.com; gary....@dektech.com.au; Nagendra
>>
>> > Kumar; Praveen Malviya; minh.c...@dektech.com.au
>>
>> > Cc: opensaf-devel@lists.sourceforge.net
>>
>> > Subject: [PATCH 01 of 15] amfd: Add support for cloud resilience at 
>> common
>>
>> > libs [#1620]
>>
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

Reply via email to