Hi,

Attached is the PoC patch that re-uses existing SG FSM code, by resuming the SG FSM state after first controller comes up.
Thereby, the patch avoids rebooting of the node in transition.

More about the patch below:

With this approach SG FSM was successfully recovered for QUIESCED and QUIESCING state transition of SUSI (SUs for each SIs)in the following admin operations cases (without SI deps):
1)SU SHUTDOWN and LOCK.
2)SI LOCK and SHUTDOWN.
3)SG LOCK and SHUTDOWN.

After resuming of SG FSM state, SG moved to correct state after first controller comes up and completed the admin operation as it does now in normal cluster. Also UNLOCK operation was successful in all the cases.

In the delayed_failover approach (06-08), the problem was HA state of SU for each SI was not considered and each SUSI was assumed assigned. Because of this, original state of SU and hence SG FSM could not be resumed.

Approach in this SG FSM recovery patch:
It recovers each SUSI FSM state and using this it resumes SG in same FSM state as it was before controllers went down.Thus it will use the original SG FSM code.

Some benefits of this approach:
    1) Existing code of SG FSM can be used.
    2) Does not require node reboot in transition state.
3) SG FSM code for each model already handles faults, si deps and all admin operation so always any issue will just require deducing the SG FSM state at the time of controller down and resuming SG in the same state. 4)There are FIVE SG FSM states in our code out of which STABLE state of SG is not applicable for transition state. So there are only FOUR SG fsm states to be resumed.

Note: For testing admin op, cluster was freshly started for each lock and shutdown operation as assignment counter related changes is not done in this patch.

Thanks,
Praveen.



On 04-Mar-16 9:11 AM, minh chau wrote:
Hi Praveen,

Please see my comments in line with [Minh]

Thanks,
Minh

On 04/03/16 00:41, praveen malviya wrote:
Hi Minh,

The second version of the patches you had published handles immediate
escalation only(1 to 4) but it does not performs 'immediate escalation'
during the transient phases.
[Minh] The patch version is important to be sure we are in the same
view. The latest version is V4 (not V2) that has immediate escalation in
amfnd. Perform "immediate escalation during transient phases" you mean
to me is "reboot node that has transient SUSI", and it is suggested
after V4 were published. As far as concerns, we agree to push "immediate
escalation" (amfnd) to base patches (#1 to #4) and separate "delayed
failover" (amfd) to another patch. Then from there, we will review and
see whether or not "delayed failover" is necessary

So, the concept patch is not for "delayed failover" approach but for
doing 'Immediate escalation' during transient states also.
The 'immediate escalation' approach becomes **complete** with the
concept patch. Ofcourse, as mentioned before i would update the
concept patch further.

Regarding the scanning of SUSIs in SG, it is scanned just to know the
active and standby SU but not to handle the transition state at susi
level. After rebooting the node, existing node-failover functionality
of SG FSM will take care of things at SUSI level including si deps for
all red models. In fact, later on the patch can be evolved to call
existing SG FSM code.
[Minh] As mentioned in previous email, I understand the concept patch is
under going and issues will be fixed eventually (I would rather say a
completion). But my question was on the *value* it gives at the end.
Many healthy applications will claim losing availability since a node
reboot because of a transient SUSI in another (unimportant) one, and
node reboot is unexpected per configuration
As I understand the complexity/maintainability of AMF code is important
for maintainers, but is there any other reasons that support "immediate
escalation"? If it's the case, the concept patch seems to sacrifice
availability to gain less complexity/maintainability of code. But if we
all agree with availability is most important, then
complexity/maintainability is just matter of coding?

I think in the version1 of patches, I had given comments for SI deps
and delayed fail-over getting mixed and the way SI dependecy has been
scanned. I never got the responses of those comments and other
comments of v1 on amfd patch. Those are important comments and needs
to be addressed.
[Minh] We have received 2 emails for comments on V2 so far and all of
those had been responded. In V4 we have corrected patches according to
some of your comments

Belows are date time of responses were sent

Date: Fri, 12 Feb 2016 11:13:03 +1100
From: minh chau<minh.c...@dektech.com.au>
Subject: Re: [devel] [PATCH 1 of 5] amfd: Add README file for cloud
     resilience support [#1620] V2
To: praveen malviya<praveen.malv...@oracle.com>,
hans.nordeb...@ericsson.com,gary....@dektech.com.au,
nagendr...@oracle.com
Cc:opensaf-devel@lists.sourceforge.net


Date: Fri, 26 Feb 2016 14:41:18 +1100
From: minh chau<minh.c...@dektech.com.au>
Subject: Re: [devel] [PATCH 2 of 5] amfd: Add support for cloud
     resilience at director [#1620] V2
To: praveen malviya<praveen.malv...@oracle.com>,
hans.nordeb...@ericsson.com,gary....@dektech.com.au,
nagendr...@oracle.com
Cc:opensaf-devel@lists.sourceforge.net

Regarding the approach taken in delayed_failover() functionality, I do
not know whether it has been explored or not, but it does not use
existing SG FSM code. Using the existing code will keep it simple too.
[Minh] What's SG FSM code you think it should be used? If there's any
inappropriate codes, can we all go through it and optimize it?

Thanks,
Praveen

On 03-Mar-16 1:20 PM, minh chau wrote:
Hi Nagu, Praveen,

I have been trying your patch, with the test case below:
Setup 2N model, PL4 host SU4 (act), PL5 host SU5(stb)
1. issue admin command shutdown SG
2. Hanging quiescing csi_set callback
3. Stop both SCs
4. Stop PL4
5. Restart both SCs

I have seen this error after SCs come back also:
SC-2 osafamfd[477]: ER avd_ckpt_siass:
safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon
safSi=AmfDemoTwon,safApp=AmfDemoTwon does not exist

 From trace file, after amfd sends reboot message to PL5, realign() is
called. Then realign() creates duplicated SUSI for SU5, this duplicated
SUSI is not checked point at SC-2.
PL5 reboot, node_fail() of 2N SG calls AVD_SU::delete_all_susis() to
delete all SUSI of SU5. Now 2 duplicated SUSI are deleted and
checkpointed, the second one will cause "ER avd_ckpt_siass: ... does not
exisit"

This error should be happening with lock/shutdown SG/SU/Node/NodeGroup.
And Nodegroup is being stuck in SHUTTING_DOWN
I think these kinds of issue will be fixed by you eventually, but all of
these, looking through the concept patch, the complexity/maintainability
is similar to patch #6. Both have to scan through all SU/SI to determine
transient SUSI. The difference is decision to be made, one can reboot
the node, another can adjust the state. Though it seems rebooting node
will loose the availability?

Thanks,
Minh


On 03/03/16 11:32, minh chau wrote:
Hi Nagu, Praveen

From patch 09 to patch 14, they are fixes for bugs that you also need
on top of patches #4.
The problems you reported should not happen if you have them. They are
regardless whether we *reboot node if transient states* or *adjust
transient states* (delayed failover).

Patch 09 -> Return TRY_AGAIN for pg track start/stop in headless
Patch 10 -> Resend pg information to directors after headless
Patch 11 -> There are two fixes in this patch: (11_1) Fix mapping su,
and (11_2) fix amfnd coredump given that we allow comp/su failover
(patch #5). I split them
Patch 12 -> Do not disable healthy SU
Patch 13 -> It's for one payload limitation
Patch 14 -> It's for transient state at csi level, written on top of
patch #6.

So you also need patch 09, 10, 11_1, 12, 13 on top of patch #4, and
they need to be reviewed and pushed together with #1->#4 as well.

The patch #5 #6 #7 #8 are on different view from "immediate
escalation" and "reboot node if transient states".

We will look at your assignment_recovery.patch.

I also attach patch 1620_amfd_adjust_csi_V2.diff, which is to fix the
issue in TC #27, but it also depends on conclusion of how to deal with
transient states after headless.

Thanks,
Minh

On 03/03/16 02:12, Nagendra Kumar wrote:

#1 I have applied patches #1 to #4 only. With this patches(not having
patch #6), I thought to have passed most of the following tests, but
they got failed(Listed below).

I could not test other scenarios (including alarms and
notifications), because I haven’t applied patch #6. I think there
should be a simple patch replacing patch #6, which handles transient
state as ‘reboot the node‘ if Amf finds SUSI in transient state on
that node.

I am attaching a concept patch(assignment_recovery.patch), which pass
some of the scenarios and we are testing and enhancing it.

As Praveen has suggested that we need to reboot the node which is
undergoing in transient state to make it simple.

This patch reduces complexity and maintainability.

So, ACK for patch #1-#4 along with the attached patch.

Please note that the attached patch has been created on patch #6 of
yours, so please apply #1 to #4 and then #6 and then the attached
patch.

Currently the patch is for 2N red model. We are working to make for
Nway Act and No red model (and possibly for Nway and NpM), we will
publish it tomorrow.

TC #1:

Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover
as false) and logs attached(TC 1) in the ticket.

1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.

2. Stop SC-1 and kill demo. It goes for comp failover as configured.
Ideally, node should reboot.

3. Start SC-1. After cluster timer expires, PL-4 got the following
error messages:

Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition
failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 :
SI=safSi=AmfDemo,safApp=AmfDemo1

Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition
failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 :
SI=safSi=AmfDemo1,safApp=AmfDemo1

There is no assignment given for SU1. SU2 has Standby assignments:

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1


saAmfSISUHAState=STANDBY(2)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1


saAmfSISUHAState=STANDBY(2)

Other problems: a.) Further command for locking SU1/SU2 fails in SG
unstable error.

                                b.) Immlist if SU2 gives the below
result, Standby assignment it prints as 4, which is wrong:

saAmfSUNumCurrStandbySIs SA_UINT32_T  4 (0x4)

saAmfSUNumCurrActiveSIs SA_UINT32_T  0 (0x0)

                                c.) Even if SC-2 joins, and you do
failover/switchover of SC-1, still same as above.

TC #2: After execution of TC #1, stop PL-3. In worst case, SU2
assignment should change to Act, which is not happening. After
stopping of PL-4 also, the same problems as TC #1. logs attached(TC
2).

TC #3: After TC #2, start PL-3 and start SC-2.

                SU1 is instantiated, but no assignment and the same
problem as above.

                When stop PL-4, SU1 gets assignments, the following
logs comes at SC-2:

Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass:
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo,safApp=AmfDemo1
does not exist

Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass:
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1
safSi=AmfDemo1,safApp=AmfDemo1 does not exist

Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784933] tipc: Resetting link
<1.1.2:eth0-1.1.4:eth0>, peer not responding

Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784947] tipc: Lost link
<1.1.2:eth0-1.1.4:eth0> on network plane A

Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784956] tipc: Lost contact
with <1.1.4>

Start PL-4, SU2 gets Standby assignments and everything works fine
after that.

TC #4: Similar problems exist in the following test cases:

a.)Configuration same as TC #1 except saAmfSutDefSUFailover as true.

                After killing demo, PL-3 went for reboot.

                But the problem is the same as shown in TC #1, TC #2
and TC #3.

b.) Configuration same as TC #1 except with
 saAmfCtDefRecoveryOnError as 2 and saAmfCtDefDisableRestart as 1.

                But the problem is the same as shown in TC #1, TC #2
and TC #3.

c.)Configuration same as TC #1 except with saAmfCtDefRecoveryOnError
as 2 and saAmfCtDefDisableRestart as 1 and saAmfSutDefSUFailover as 1.

                After killing demo, PL-3 went for reboot.

                But the problem is the same as shown in TC #1, TC #2
and TC #3.

TC #5:  Configuration same as TC #1 except with
 saAmfCtDefRecoveryOnError as 2. Configuration and logs(TC 5)
attached in ticket.

1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.

2. Stop SC-1 and kill demo. It goes for comp restart as configured.

3. Start SC-1. After SC-1 comes up and before cluster timer expires,
stop PL-3:

Even if PL-3 is stopped(see below PL-3 is not available), SU1 is
still having Act assignment and SU2 is having Standby assignment:

PM_SC-1:/home/nagu/views/staging # amf-state siass

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1


        saAmfSISUHAState=STANDBY(2)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1


       saAmfSISUHAState=STANDBY(2)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF


        saAmfSISUHAState=ACTIVE(1)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF


        saAmfSISUHAState=ACTIVE(1)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1


        saAmfSISUHAState=ACTIVE(1)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF


        saAmfSISUHAState=ACTIVE(1)

   saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU1\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1


        saAmfSISUHAState=ACTIVE(1)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

TC #6:  After TC #5, start PL-3:

SU1 is not given any assignment (may be because it exists in Amfd db):

Mar  2 14:22:06 PM_PL-3 osafamfwd[8318]: Started

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO
'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' Presence State INSTANTIATING
=> INSTANTIATED

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigning
'safSi=NoRed2,safApp=OpenSAF' ACTIVE to
'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO Assigned
'safSi=NoRed2,safApp=OpenSAF' ACTIVE to
'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State
UNINSTANTIATED => INSTANTIATING

Mar  2 14:22:06 PM_PL-3 opensafd: OpenSAF(5.0.M0 -
7282:4fbffe857512:) services successfully started

Mar  2 14:22:06 PM_PL-3 amf_demo[8337]:
'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started

Mar  2 14:22:06 PM_PL-3 osafamfnd[8259]: NO
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State
INSTANTIATING => INSTANTIATED

Mar  2 14:22:06 PM_PL-3 amf_demo[8337]: HC started with AMF

TC #7:  After TC #6:

Lock SU1: Amfnd of PL-3 throws error:

Mar  2 14:23:57 PM_PL-3 osafamfnd[8259]: ER susi_assign_evh:
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments

This is obvious because, Amfnd doesn’t have any assignment.

SU1 admin state is locked, but SUSI is being shown on SU1.

TC #8:  After TC #7:

Lock SU1, it throws error:

Admin operation is already going on
(su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1

TC #9:  Same as TC #6 except Configure saAmfCtDefRecoveryOnError as
Node Switchover/Failover/Failfast.

The problem reported in TC #4 exists.

Thanks

-Nagu

> -----Original Message-----

> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]

> Sent: 25 February 2016 14:14

> To: hans.nordeb...@ericsson.com; gary....@dektech.com.au; Nagendra

> Kumar; Praveen Malviya; minh.c...@dektech.com.au

> Cc: opensaf-devel@lists.sourceforge.net

> Subject: [PATCH 01 of 15] amfd: Add support for cloud resilience at
common

> libs [#1620]





diff --git a/osaf/libs/saf/include/saAmf.h b/osaf/libs/saf/include/saAmf.h
--- a/osaf/libs/saf/include/saAmf.h
+++ b/osaf/libs/saf/include/saAmf.h
@@ -76,7 +76,26 @@ typedef enum {
     SA_AMF_HA_ACTIVE = 1,
     SA_AMF_HA_STANDBY = 2,
     SA_AMF_HA_QUIESCED = 3,
-    SA_AMF_HA_QUIESCING = 4
+    SA_AMF_HA_QUIESCING = 4,
+    /*Temporary states written to avoid writing encode/decode utilities for 
quick prototyping.
+      It will be removed by using internal susi fsm state. Maybe 
saAmfSISUHAReadinessState of
+      SaAmfSIAssignmentcan be used. For the use of these states see 
amfnd/di.cc changes*/
+    SA_AMF_HA_11 = 11,
+    SA_AMF_HA_21 = 21,
+    SA_AMF_HA_31 = 31,
+    SA_AMF_HA_41 = 41,
+    SA_AMF_HA_12 = 12,
+    SA_AMF_HA_22 = 22,
+    SA_AMF_HA_32 = 32,
+    SA_AMF_HA_42 = 42,
+    SA_AMF_HA_13 = 13,
+    SA_AMF_HA_23 = 23,
+    SA_AMF_HA_33 = 33,
+    SA_AMF_HA_43 = 43,
+    SA_AMF_HA_14 = 14,
+    SA_AMF_HA_24 = 24,
+    SA_AMF_HA_34 = 34,
+    SA_AMF_HA_44 = 44,
 } SaAmfHAStateT;
 
 typedef enum {                                                 
diff --git a/osaf/services/saf/amf/amfd/cluster.cc 
b/osaf/services/saf/amf/amfd/cluster.cc
--- a/osaf/services/saf/amf/amfd/cluster.cc
+++ b/osaf/services/saf/amf/amfd/cluster.cc
@@ -83,6 +83,11 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB 
                        continue;
                }
 
+               /* If it is cloud resilience solution, try to resume FSM state 
of SG. */
+               if (cb->scs_absence_max_duration > 0) {
+                       i_sg->resume_sg_fsm_state(cb);
+               }
+
                if (i_sg->sg_fsm_state == AVD_SG_FSM_STABLE)
                        i_sg->realign(cb, i_sg);
        }
diff --git a/osaf/services/saf/amf/amfd/include/sg.h 
b/osaf/services/saf/amf/amfd/include/sg.h
--- a/osaf/services/saf/amf/amfd/include/sg.h
+++ b/osaf/services/saf/amf/amfd/include/sg.h
@@ -288,6 +288,14 @@ public:
        virtual void node_fail(AVD_CL_CB *cb, AVD_SU *su) = 0;
 
        /**
+        * Resume FSM state of SG as it was before controllers went down.
+        * Used in when cloud resilience solution is enabled.
+        * @param cb
+        * @return
+        */
+       virtual void resume_sg_fsm_state(AVD_CL_CB *cb);
+
+       /**
         * Handle SG realign
         * Assign SIs if needed. If any assigning is gets done it adds
         * the SUs to the operation list and sets the SG FSM state to SG 
realign.
@@ -439,6 +447,7 @@ class SG_2N : public AVD_SG {
 public:
        ~SG_2N();
        void node_fail(AVD_CL_CB*, AVD_SU*);
+       void resume_sg_fsm_state(AVD_CL_CB *cb);
        uint32_t realign(AVD_CL_CB *cb, AVD_SG *sg);
        uint32_t si_assign(AVD_CL_CB *cb, AVD_SI *si);
        uint32_t si_admin_down(AVD_CL_CB *cb, AVD_SI *si);
diff --git a/osaf/services/saf/amf/amfd/sg.cc b/osaf/services/saf/amf/amfd/sg.cc
--- a/osaf/services/saf/amf/amfd/sg.cc
+++ b/osaf/services/saf/amf/amfd/sg.cc
@@ -2027,3 +2027,8 @@ uint32_t AVD_SG::curr_non_instantiated_s
                [](AVD_SU *su) -> bool { return ((su->list_of_susi == nullptr) 
&&
                        (su->saAmfSUPresenceState == 
SA_AMF_PRESENCE_UNINSTANTIATED));}));      
 }
+// default implementation
+void AVD_SG::resume_sg_fsm_state(AVD_CL_CB *cb) {
+       return;
+}
+
diff --git a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc 
b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
--- a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
+++ b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
@@ -3533,6 +3533,181 @@ void SG_2N::node_fail(AVD_CL_CB *cb, AVD
 done:
        TRACE_LEAVE();
 }
+//Function resumes SG FSM state, after first controller comes up in cloud 
resilience solution.
+void SG_2N::resume_sg_fsm_state(AVD_CL_CB *cb) {
+       TRACE_ENTER();
+       AVD_SU *su_on_op = nullptr;
+       AVD_AVND *node_on_op = nullptr;
+       AVD_SI *si_on_op = nullptr;
+       SaAmfHAStateT su_ha_state;      
+
+       //Step 1: Know the admin operation entity and about si deps within su.
+
+       //Check if admin operation was going on su or node.
+       for (const auto& su : list_of_su) {
+               if (su->list_of_susi == NULL)
+                       continue;
+               if (su->saAmfSUAdminState != SA_AMF_ADMIN_UNLOCKED) {
+                       su_on_op = su;
+                       LOG_NO("Operation was on su:%s",su->name.value);
+               }
+               if (su->su_on_node->saAmfNodeAdminState != 
SA_AMF_ADMIN_UNLOCKED) {
+                       node_on_op = su->su_on_node;
+                       LOG_NO("Operation was on 
node:%s",su->su_on_node->name.value);
+               }
+       }
+       //Check if admin operation was going on si.
+       for (const auto& si : list_of_si) {
+               if ((si->list_of_sisu) && (si->saAmfSIAdminState != 
SA_AMF_ADMIN_UNLOCKED)) {
+                       si_on_op = si;
+                       LOG_NO("Operation was on si:%s",si->name.value);
+                       break;
+               }
+       }
+       //NG to be considered later.
+       //Check if SI deps is configured within SU. 
+       bool si_dep_exist = false;
+       for (const auto& su : list_of_su) {
+               if ((su->list_of_susi) && 
avd_sidep_si_dependency_exists_within_su(su)) {
+                       si_dep_exist = true;
+                       LOG_NO("si deps exists within su in sg:%s",name.value);
+               }
+       }
+
+       /*Step 2: Regain the fsm state of each SUSI. This was properly set 
while recreating SUSIs in 
+               in avd_susi_recreate() from AMFND. AMFND message about SUSI 
recreation carries 
+               assignment status of SI also. */
+
+       //Resume SG FSM state when SU lock/shutdown operation was going on.
+       if (su_on_op && (su_on_op->saAmfSUOperState == 
SA_AMF_OPERATIONAL_ENABLED)) {
+
+               /*Since SUSI FSM state are also proper set now, 
avd_su_state_determine()
+                 will deduce SU state correctly.
+                */
+               su_ha_state = avd_su_state_determine(su_on_op);
+               if (si_dep_exist) {
+                       //TODO: case when SI dep is configured.
+               } else {
+
+                       //No SI Deps means all SUSI will have same fsm states.
+                       if ((su_ha_state == SA_AMF_HA_QUIESCING) || 
(su_ha_state == SA_AMF_HA_QUIESCED)) {
+                               if (su_on_op->any_susi_fsm_in_modify() == true) 
{
+                                       /*This means assignment is pending from 
amfnd or component
+                                         has not responded till now for 
quiesced/quiescing states.
+                                         AMFND assignment response will 
trigger SG FSM, so just set
+                                         FSM state.
+                                        */
+                                       avd_sg_su_oper_list_add(cb, su_on_op, 
false);
+                                       m_AVD_SET_SG_FSM(cb, (this), 
AVD_SG_FSM_SU_OPER);
+                               } else if (su_on_op->any_susi_fsm_in_unasgn() 
== true) {
+                                       /*This means assignment is pending from 
amfnd or component
+                                         has not responded till now for 
removal of assignments.
+                                         AMFND assignment response will 
trigger SG FSM, so just set
+                                         FSM state.*/
+                                       avd_sg_su_oper_list_add(cb, su_on_op, 
false);
+                                       m_AVD_SET_SG_FSM(cb, (this), 
AVD_SG_FSM_SG_REALIGN);
+                               } else if (all_quiesced(su_on_op) == true) {
+                                       /*This means assignment is not pending 
from amfnd and component
+                                       has responded for quiesced state.*/
+                                       //TODO: Trigger SG FSM now because 
AMFND will not send any event
+                                       // to trigger SG FSM.
+                               }
+                       } else if (su_ha_state == SA_AMF_HA_STANDBY) {
+                                       avd_sg_su_oper_list_add(cb, su_on_op, 
false);
+                                       m_AVD_SET_SG_FSM(cb, (this), 
AVD_SG_FSM_SG_REALIGN);
+                       } else if (su_ha_state == SA_AMF_HA_ACTIVE) {
+                               /*TODO:This means admin operatin was about to 
start when controllers went down.
+                                 There are choices here a)revert the admin 
state and ask operator to 
+                                 reissue the admin operation or b) call 
su_admin_down() to continue 
+                                 the operation.*/
+                       }
+               }
+       }
+       //Resume SG FSM state in SI lock/shutdown.
+       if (si_on_op) {
+               AVD_SU *su_standby = NULL, *su_active = NULL, *su_quiesce = 
NULL;
+               for (AVD_SU_SI_REL *curr_susi = si_on_op->list_of_sisu; 
curr_susi;
+                               curr_susi = curr_susi->si_next) {
+                       if (curr_susi->state == SA_AMF_HA_ACTIVE)
+                               su_active = curr_susi->su;
+                       if (curr_susi->state == SA_AMF_HA_STANDBY)
+                               su_standby = curr_susi->su;
+                       if (curr_susi->state == SA_AMF_HA_QUIESCED || 
curr_susi->state == SA_AMF_HA_QUIESCING)
+                               su_quiesce = curr_susi->su;
+               }
+                if (si_dep_exist) {
+                } else {
+                       //No SI Deps means all SUSI will have same fsm states.
+                        if (su_quiesce && ((su_quiesce->list_of_susi->state == 
SA_AMF_HA_QUIESCING) ||
+                                        (su_quiesce->list_of_susi->state == 
SA_AMF_HA_QUIESCED))) {
+                                if (su_quiesce->any_susi_fsm_in_modify() == 
true) {
+                                       /*This means assignment is pending from 
amfnd or component
+                                         has not responded till now for 
quiesced/quiescing states.
+                                         AMFND assignment response will 
trigger SG FSM, so just set
+                                         FSM state.
+                                        */
+                                       m_AVD_SET_SG_ADMIN_SI(cb, si_on_op);
+                                        m_AVD_SET_SG_FSM(cb, (this), 
AVD_SG_FSM_SI_OPER);
+                                } else if 
(su_quiesce->any_susi_fsm_in_unasgn() == true) {
+                                       /*This means assignment is pending from 
amfnd or component
+                                         has not responded till now for 
removal of assignments.
+                                         AMFND assignment response will 
trigger SG FSM, so just set
+                                         FSM state.*/
+                                        avd_sg_su_oper_list_add(cb, su_on_op, 
false);
+                                        m_AVD_SET_SG_FSM(cb, (this), 
AVD_SG_FSM_SG_REALIGN);
+                                } else if (all_quiesced(su_quiesce) == true) {
+                                       /*This means assignment is not pending 
from amfnd and component
+                                       has responded for quiesced state.*/
+                                       //TODO: Trigger SG FSM now because 
AMFND will not send any event
+                                       // to trigger SG FSM.
+                               }
+                        }  else if (su_standby) {
+                               //TODO
+                        }  else if (su_active) {
+                               //TODO
+                       }
+                }
+        }
+       //Resume SG FSM state SG lock/shutdown.
+       if (saAmfSGAdminState != SA_AMF_ADMIN_UNLOCKED) {
+               AVD_SU *act= NULL, *std = NULL, *su_in_transition = NULL, 
su_node;
+               for (const auto& su : list_of_su) {
+                       SaAmfHAStateT su_ha_state;
+                       if (su->list_of_susi) {
+                               su_ha_state = avd_su_state_determine(su);
+                               if (su_ha_state == SA_AMF_HA_QUIESCED ||
+                                               su_ha_state == 
SA_AMF_HA_QUIESCING) {
+                                       su_in_transition = su;
+                               } else if (su_ha_state == SA_AMF_HA_ACTIVE) {
+                                       act = su;
+                               } else if (su_ha_state == SA_AMF_HA_STANDBY) {
+                                       std = su;
+                               }
+                       }
+               }
+               if (su_in_transition && 
(su_in_transition->any_susi_fsm_in_modify() == true)) {
+                       /*This means assignment is pending from amfnd or 
component
+                         has not responded till now for quiesced/quiescing 
states.
+                         AMFND assignment response will trigger SG FSM, so 
just set
+                         FSM state.
+                        */
+                       m_AVD_SET_SG_FSM(cb, (this), AVD_SG_FSM_SG_ADMIN);
+               } else if (std) {
+                       //TODO
+               } else if (act) {
+                       //TODO
+               }
+       }
+       //Resume SG FSM state in node lock/shutdown.
+       if (node_on_op) {
+               //TODO.
+       }
+       //Resume SG FSM state in ng lock/shutdown.
+       //Resume SG FSM state in Faults 
su-failover/comp-failover/node-switchover/node-failover. 
+       //Resume SG FSM state in Admin operation mix with faults su/comp/nodei 
failover/switchover . 
+
+       TRACE_LEAVE();
+}
 
 uint32_t SG_2N::su_admin_down(AVD_CL_CB *cb, AVD_SU *su, AVD_AVND *avnd) {
        uint32_t rc = NCSCC_RC_FAILURE;
diff --git a/osaf/services/saf/amf/amfd/siass.cc 
b/osaf/services/saf/amf/amfd/siass.cc
--- a/osaf/services/saf/amf/amfd/siass.cc
+++ b/osaf/services/saf/amf/amfd/siass.cc
@@ -865,6 +865,7 @@ SaAisErrorT avd_susi_recreate(AVSV_N2D_N
 
        for (susi_state = info->sisu_list; susi_state != nullptr;
                        susi_state = susi_state->next) {
+               AVD_SU_SI_STATE fsm_state = AVD_SU_SI_STATE_ABSENT;
 
                assert(susi_state->safSI.length > 0);
                AVD_SI *si = si_db->find(Amf::to_string(&susi_state->safSI));
@@ -874,7 +875,52 @@ SaAisErrorT avd_susi_recreate(AVSV_N2D_N
                osafassert(su);
 
                SaAmfHAStateT ha_state = susi_state->saAmfSISUHAState;
+               /*TODO: use si assignment state variable used by AMFND.
+                 All the if else will get removed then.As of now decode SI 
assignment state i
+                 information from newly introduced HA states.
+                 Details of SA_AMF_HA_{11-44} are in amfnd/di.cc changes.*/
 
+               if (susi_state->saAmfSISUHAState == SA_AMF_HA_11) {
+                       ha_state = SA_AMF_HA_ACTIVE; 
+                       fsm_state = AVD_SU_SI_STATE_MODIFY; 
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_21) {
+                       ha_state = SA_AMF_HA_ACTIVE; 
+                       fsm_state = AVD_SU_SI_STATE_UNASGN; 
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_31) {
+                       ha_state = SA_AMF_HA_ACTIVE; 
+                       fsm_state = AVD_SU_SI_STATE_ASGND; 
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_41) {
+                       ha_state = SA_AMF_HA_ACTIVE; 
+                       fsm_state = AVD_SU_SI_STATE_UNASGN; 
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_12) {
+                       ha_state = SA_AMF_HA_STANDBY; 
+                        fsm_state = AVD_SU_SI_STATE_MODIFY;
+                } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_22) {
+                       ha_state = SA_AMF_HA_STANDBY; 
+                         fsm_state = AVD_SU_SI_STATE_UNASGN;
+                } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_32) {
+                       ha_state = SA_AMF_HA_STANDBY; 
+                         fsm_state = AVD_SU_SI_STATE_ASGND;
+                } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_42) {
+                       ha_state = SA_AMF_HA_STANDBY; 
+                        fsm_state = AVD_SU_SI_STATE_UNASGN;
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_13) {
+                       ha_state = SA_AMF_HA_QUIESCED; 
+                       fsm_state = AVD_SU_SI_STATE_MODIFY; 
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_33) {
+                       ha_state = SA_AMF_HA_QUIESCED; 
+                       fsm_state = AVD_SU_SI_STATE_ASGND; 
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_14) {
+                        ha_state = SA_AMF_HA_QUIESCING;
+                        fsm_state = AVD_SU_SI_STATE_MODIFY;
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_33) {
+                       ha_state = SA_AMF_HA_QUIESCED;
+                       fsm_state = AVD_SU_SI_STATE_ASGND;
+               } else if (susi_state->saAmfSISUHAState == SA_AMF_HA_44) {
+                       ha_state = SA_AMF_HA_QUIESCING; 
+                       fsm_state = AVD_SU_SI_STATE_UNASGN; 
+               }
+       
                susi = avd_su_susi_find(avd_cb, su, &susi_state->safSI);
                if (susi == nullptr) {
                        susi = avd_susi_create(avd_cb, si, su, ha_state, false);
@@ -882,7 +928,10 @@ SaAisErrorT avd_susi_recreate(AVSV_N2D_N
                } else {
                        avd_susi_ha_state_set(susi, ha_state);
                }
-               susi->fsm = AVD_SU_SI_STATE_ASGND;
+
+               susi->fsm = fsm_state;
+               LOG_NO("***********Recovered state of SUSI si:%s, su:%s, HA:%u, 
fsm:%u",
+                               
susi->si->name.value,susi->su->name.value,susi->state, susi->fsm);
 
                if (susi->state == SA_AMF_HA_QUIESCING) {
                        susi->su->inc_curr_act_si();
diff --git a/osaf/services/saf/amf/amfnd/di.cc 
b/osaf/services/saf/amf/amfnd/di.cc
--- a/osaf/services/saf/amf/amfnd/di.cc
+++ b/osaf/services/saf/amf/amfnd/di.cc
@@ -1588,6 +1588,49 @@ void avnd_sync_sisu(AVND_CB *cb)
                        si_assignment.si = si->name;
                        si_assignment.saAmfSISUHAState = si->curr_state;
 
+                       /*TODO: use  saAmfSISUHAReadinessState or introduce new 
variable to send 
+                         assignment status of SI.
+                         As of now merge SI assignment state information with 
HA state using new HA state.
+                         Here are the tmp rules:
+                               ASSIGNING = SA_AMF_HA_${10 + actual HA state}.
+                               REMOVING  = SA_AMF_HA_${20 + actual HA state}.
+                               ASSIGNED  = SA_AMF_HA_${30 + actual HA state}.
+                               REMOVED   = SA_AMF_HA_${40 + actual HA state}.
+                       e.g  quiesced SI in ASSIGNING state = SA_AMF_HA_{10 + 
3} = SA_AMF_HA_13.        
+                       AMFD will decode HA state and assignment state in the 
same way.*/
+                       if (si->curr_state == SA_AMF_HA_QUIESCED && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(si)) {
+                               si_assignment.saAmfSISUHAState = SA_AMF_HA_13;
+                       } else if (si->curr_state == SA_AMF_HA_QUIESCED && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVING(si)) {
+                               si_assignment.saAmfSISUHAState = SA_AMF_HA_23;
+                       } else if (si->curr_state == SA_AMF_HA_QUIESCED && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(si)) {
+                               si_assignment.saAmfSISUHAState = SA_AMF_HA_33;
+                       } else if (si->curr_state == SA_AMF_HA_QUIESCED && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVED(si)){ 
+                               si_assignment.saAmfSISUHAState = SA_AMF_HA_43;
+                       } else if (si->curr_state == SA_AMF_HA_QUIESCING && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_14;
+                        } else if (si->curr_state == SA_AMF_HA_QUIESCING && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVING(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_24;
+                        } else if (si->curr_state == SA_AMF_HA_QUIESCING && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_34;
+                        } else if (si->curr_state == SA_AMF_HA_QUIESCING && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVED(si)){
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_44;
+                       } else if (si->curr_state == SA_AMF_HA_ACTIVE && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_11;
+                        } else if (si->curr_state == SA_AMF_HA_ACTIVE && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVING(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_21;
+                        } else if (si->curr_state == SA_AMF_HA_ACTIVE && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_31;
+                        } else if (si->curr_state == SA_AMF_HA_ACTIVE && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVED(si)){
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_41;
+                       } else if (si->curr_state == SA_AMF_HA_STANDBY && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_12;
+                        } else if (si->curr_state == SA_AMF_HA_STANDBY && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVING(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_22;
+                        } else if (si->curr_state == SA_AMF_HA_STANDBY && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(si)) {
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_32;
+                        } else if (si->curr_state == SA_AMF_HA_STANDBY && 
m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVED(si)){
+                                si_assignment.saAmfSISUHAState = SA_AMF_HA_42;
+                        }
                        add_sisu_state_info(&msg, &si_assignment);
                }
 
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to