Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

Nagendra Kumar Tue, 15 Mar 2016 01:56:29 -0700

Hi Minh,

                I am going to execute test cases with all the patches applied. 
I will share the test results once I am done.

Can you please send all the patches (latest if any) in a tar file.

Please note that few of the test cases executed on previous patch sets would be 
invalid because of 'immediate esclation' implemented in the latest 
patches(#1-#4).

Thanks

-Nagu

From: minh chau [mailto:minh.c...@dektech.com.au] 
Sent: 15 March 2016 04:51
To: Nagendra Kumar; hans.nordeb...@ericsson.com; gary....@dektech.com.au; 
Praveen Malviya
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 01 of 15] amfd: Add support for cloud resilience at common 
libs [#1620]

Hi Nagu, Praveen

Since #1-#4 have been acked, can you please push them?
#5 and #11_2 allows comp/su failover during headless, so we may have to visit 
them later.
However, the patches: #9 #10 #11_1 #12 #13 are bug fixes that does not relate 
to *delayed failover* and needed for #1-#4. Can you please have a look?

Thanks,
Minh

On 03/03/16 02:12, Nagendra Kumar wrote:

#1 I have applied patches #1 to #4 only. With this patches(not having patch 
#6), I thought to have passed most of the following tests, but they got 
failed(Listed below).

I could not test other scenarios (including alarms and notifications), because 
I haven't applied patch #6. I think there should be a simple patch replacing 
patch #6, which handles transient state as 'reboot the node' if Amf finds SUSI 
in transient state on that node.

I am attaching a concept patch(assignment_recovery.patch), which pass some of 
the scenarios and we are testing and enhancing it.

As Praveen has suggested that we need to reboot the node which is undergoing in 
transient state to make it simple.

This patch reduces complexity and maintainability.

So, ACK for patch #1-#4 along with the attached patch.

Please note that the attached patch has been created on patch #6 of yours, so 
please apply #1 to #4 and then #6 and then the attached patch.

Currently the patch is for 2N red model. We are working to make for Nway Act 
and No red model (and possibly for Nway and NpM), we will publish it tomorrow.

TC #1:

Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover as false) 
and logs attached(TC 1) in the ticket.

1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.

2. Stop SC-1 and kill demo. It goes for comp failover as configured. Ideally, 
node should reboot.

3. Start SC-1. After cluster timer expires, PL-4 got the following error 
messages:

Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition failed, SU= 
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : SI=safSi=AmfDemo,safApp=AmfDemo1

Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition failed, SU= 
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : SI=safSi=AmfDemo1,safApp=AmfDemo1

There is no assignment given for SU1. SU2 has Standby assignments:

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1

        saAmfSISUHAState=STANDBY(2)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1

        saAmfSISUHAState=STANDBY(2)

Other problems: a.) Further command for locking SU1/SU2 fails in SG unstable 
error. 

                                b.) Immlist if SU2 gives the below result, 
Standby assignment it prints as 4, which is wrong:

                                saAmfSUNumCurrStandbySIs                        
   SA_UINT32_T  4 (0x4)

                                saAmfSUNumCurrActiveSIs                         
   SA_UINT32_T  0 (0x0)

                                c.) Even if SC-2 joins, and you do 
failover/switchover of SC-1, still same as above.

TC #2: After execution of TC #1, stop PL-3. In worst case, SU2 assignment 
should change to Act, which is not happening. After stopping of PL-4 also, the 
same problems as TC #1. logs attached(TC 2).

TC #3: After TC #2, start PL-3 and start SC-2.

                SU1 is instantiated, but no assignment and the same problem as 
above.

                When stop PL-4, SU1 gets assignments, the following logs comes 
at SC-2:

Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo,safApp=AmfDemo1 does not 
exist

Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo1,safApp=AmfDemo1 does not 
exist

Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784933] tipc: Resetting link 
<1.1.2:eth0-1.1.4:eth0>, peer not responding

Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784947] tipc: Lost link 
<1.1.2:eth0-1.1.4:eth0> on network plane A

Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784956] tipc: Lost contact with <1.1.4>

Start PL-4, SU2 gets Standby assignments and everything works fine after that.

TC #4: Similar problems exist in the following test cases:

a.)    Configuration same as TC #1 except saAmfSutDefSUFailover as true.

                After killing demo, PL-3 went for reboot.

                But the problem is the same as shown in TC #1, TC #2 and TC #3.

b.)     Configuration same as TC #1 except with  saAmfCtDefRecoveryOnError as 2 
and saAmfCtDefDisableRestart as 1.

                But the problem is the same as shown in TC #1, TC #2 and TC #3.

c.)     Configuration same as TC #1 except with  saAmfCtDefRecoveryOnError as 2 
and saAmfCtDefDisableRestart as 1 and saAmfSutDefSUFailover as 1.

                After killing demo, PL-3 went for reboot.

                But the problem is the same as shown in TC #1, TC #2 and TC #3.

TC #5:  Configuration same as TC #1 except with  saAmfCtDefRecoveryOnError as 
2. Configuration and logs(TC 5) attached in ticket.

1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.

2. Stop SC-1 and kill demo. It goes for comp restart as configured.

3. Start SC-1. After SC-1 comes up and before cluster timer expires, stop PL-3:

Even if PL-3 is stopped(see below PL-3 is not available), SU1 is still having 
Act assignment and SU2 is having Standby assignment:

PM_SC-1:/home/nagu/views/staging # amf-state siass

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1

        saAmfSISUHAState=STANDBY(2)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1

       saAmfSISUHAState=STANDBY(2)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF

        saAmfSISUHAState=ACTIVE(1)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF

        saAmfSISUHAState=ACTIVE(1)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1

        saAmfSISUHAState=ACTIVE(1)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF

        saAmfSISUHAState=ACTIVE(1)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1

        saAmfSISUHAState=ACTIVE(1)

        saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

TC #6:  After TC #5, start PL-3:

SU1 is not given any assignment (may be because it exists in Amfd db):

Mar  2 14:22:06 PM_PL-3 osafamfwd[8318]: Started

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO 
'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' Presence State INSTANTIATING => 
INSTANTIATED

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigning 
'safSi=NoRed2,safApp=OpenSAF' ACTIVE to 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigned 
'safSi=NoRed2,safApp=OpenSAF' ACTIVE to 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO 
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State UNINSTANTIATED => 
INSTANTIATING

Mar  2 14:22:06 PM_PL-3 opensafd: OpenSAF(5.0.M0 - 7282:4fbffe857512:) services 
successfully started

Mar  2 14:22:06 PM_PL-3 amf_demo[8337]: 
'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO 
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATING => 
INSTANTIATED

Mar  2 14:22:06 PM_PL-3 amf_demo[8337]: HC started with AMF

TC #7:  After TC #6:

Lock SU1: Amfnd of PL-3 throws error:

Mar  2 14:23:57 PM_PL-3 osafamfnd[8259]: ER susi_assign_evh: 
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments

This is obvious because, Amfnd doesn't have any assignment.

SU1 admin state is locked, but SUSI is being shown on SU1.

TC #8:  After TC #7:

Lock SU1, it throws error:

Admin operation is already going on (su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1

TC #9:  Same as TC #6 except Configure saAmfCtDefRecoveryOnError as Node 
Switchover/Failover/Failfast.

The problem reported in TC #4 exists.

Thanks

-Nagu

> -----Original Message-----

> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]

> Sent: 25 February 2016 14:14

> To: HYPERLINK 
> "mailto:hans.nordeb...@ericsson.com"hans.nordeb...@ericsson.com; HYPERLINK 
> "mailto:gary....@dektech.com.au"gary....@dektech.com.au; Nagendra

> Kumar; Praveen Malviya; HYPERLINK 
> "mailto:minh.c...@dektech.com.au"minh.c...@dektech.com.au

> Cc: HYPERLINK 
> "mailto:opensaf-devel@lists.sourceforge.net"opensaf-devel@lists.sourceforge.net

> Subject: [PATCH 01 of 15] amfd: Add support for cloud resilience at common

> libs [#1620]

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

Reply via email to