Hi Nagu,

Attached patch is for TC 41, 42.
We have noticed one bug in sidep, will update it soon.

Thanks,
Minh

On 22/02/16 23:48, Hans Nordebäck wrote:
Hi,

please see enclosed patch for TC #1, #6, #7, #8, #9, #10 and #11/Thanks HansN

On 02/19/2016 10:09 AM, minh chau wrote:
Hi Nagu,

Thanks for your testing.
Below is our investigation from TC1 - TC31 which seem to be important, plus some patches that we're trying to fix the issues

1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11)
Discussion is on-going. When we hit the limitation, which causes mismatch of objects in amfd/imm vs amfnd: - amfd's trying to tolerate the mismatch or the worst case that amfd orders node reboot to the last payload in cluster to avoid amfd cyclic crash

2. Suspicious setting SU oper state in amfnd (TC #13, #24)
According the trace in TC24, after unlock-in nodegroup, amfnd change SU oper state to DISABLED, which is wrong since no fault happens on SU. We can raise ticket on non-headless code base. Though we think the patch 1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now TC#13 has similar problem, but provided trace is only for amfd, so we don't know where amfnd changed SU oper state to DISABLED. And we haven't been able to reproduce this problem in TC#13 so far

3. Problem in TC #16
It seems the fault lies in the base code when the system is not headless. The admin state should not stay in unlocked state. We can raise a ticket on the current non-headless code later.

4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, #22, #26)
Have patch for this, please apply fix.patch

5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14, #17, #19, #20, #21)
Have patch for this, please apply fix.patch

6. Support Nodegroup handling in delayed failover (TC #23)
At the time we developed AMF cloud resilience, we haven't had nodegroup pushed. So we missed it.
Please apply the patch 1620_amfd_adjust_interm_admin_state.diff

7. Problem in TC #25
We think it's not really a fault. Please see our opinion at the end of email

8. Delayed failover needs to check csi level (TC #27)
Fault reproducible, however it seems a rare use case where user creates extra csi just before decides to go headless. Fix is on going

9. Recover non existed csi (TC #28)
CSI had been deleted in IMM, but there's delay at application so its assignment object is still in amfnd at the time recovery. The patch just ignores to re-create this non-existed csi, please apply 1620_amfd_ignore_nonexisted_csi.diff

10. Delayed si dep issue (TC #29, #30)
Have patch for this, please apply 1620_amfd_add_su_op_list_delayed_sidep.diff

11. About TC #31, test case has fault ?
The trace shows A sponsor C, the test lock B and expect C is removed
Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51)

Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14 should be fixed by attached patches)

The tests reported on TC #32 to TC #40 on Npm, Nway, that we haven't planned to support it since haven't seen headless user using those model, so they should be buggy. We added this limitation to README, or it should be an enhancement in future.

So we're still working on some remaining TCs (#2, #3, #4, #15, #27, #41, #42), alarm/notification related issues, and testing the patches. If you find any other problems (or if you are not too busy to help reproduce TC#2, #12, #13, #14), the traces are much helpful though it would be nicer that we can have your testing model? (so we can quickly know which attr is on/off)

Thanks,
Minh


Opinion on TC#25
------------------------
TC25:
- At the time SC1 restarts, amfd adjusts the assignment. amfd decides to remove QUIESCED assignment of safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, because the gdb still hangs the
quiesced csi_set_callback

Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- amfnd-PL3 receives su_si_del msg, buffer it
Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >> avnd_su_siq_rec_buf: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >> avnd_su_siq_rec_add: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- the gdb releases csi_set_callback, the quiesced assignment sequence continues, it's finished and report to amfd Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >> avnd_di_susi_resp_send: Sending Resp su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1

Then amfnd pulls out the su_si_del which is buffered and continue the removal assignment sequence. This sequence finishes and amfnd report to amfd Feb 11 19:46:27.329483 osafamfnd [16881:di.cc:0857] TR Sending. msg_id'3', node_id'131855', msg_act'4', su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', si'', ha_state'3', error'1', single_csi'0'

- At amfd, upon receiving the report of quiesced assignment completion, amfd decides to remove
quiesced assignment of SU1
Feb 11 19:46:27.086796 osafamfd [28309:sgproc.cc:2328] >> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- So amfnd on PL3 receives another extra su_si_del msg, logs error
Feb 11 19:46:27.330333 osafamfnd [16881:su.cc:0376] >> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:27.330359 osafamfnd [16881:su.cc:0425] ER susi_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments Feb 11 19:46:27.330369 osafamfnd [16881:su.cc:0447] << avnd_evt_avd_info_su_si_assign_evh: 1

This is not a real problem, we think we can change it as warning
The improvement could be at amfnd side for non-headless code base, so that it can flexibly handle unexpected extra su_si event. #1386 has found this kind of limitation of amfnd where it could not handle another su si event as it expected, it's due to multiple failures happening along with admin command.









diff --git a/osaf/services/saf/amf/amfd/cluster.cc b/osaf/services/saf/amf/amfd/cluster.cc
--- a/osaf/services/saf/amf/amfd/cluster.cc
+++ b/osaf/services/saf/amf/amfd/cluster.cc
@@ -100,7 +100,7 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB 
 		 * to satisfy the number of assignment configuration.
 		 */
 		if (cb->scs_absence_max_duration > 0) {
-			i_sg->adjust_intermediate_sg();
+			i_sg->adjust_intermediate_sg(cb);
 			i_sg->adjust_delayed_failover(cb);
 		}
 		if (i_sg->sg_fsm_state == AVD_SG_FSM_STABLE)
diff --git a/osaf/services/saf/amf/amfd/include/sg.h b/osaf/services/saf/amf/amfd/include/sg.h
--- a/osaf/services/saf/amf/amfd/include/sg.h
+++ b/osaf/services/saf/amf/amfd/include/sg.h
@@ -424,7 +424,7 @@ public:
 	bool is_sg_serviceable_outside_ng(const AVD_AMF_NG *ng);
 	SaAisErrorT check_sg_stability();
 	bool ng_using_saAmfSGAdminState;
-	void adjust_intermediate_sg();
+	void adjust_intermediate_sg(AVD_CL_CB *cb);
 	uint32_t term_su_list_in_reverse();
        //Runtime calculates value of saAmfSGNumCurrAssignedSUs;
 	uint32_t curr_assigned_sus() const;
diff --git a/osaf/services/saf/amf/amfd/sg.cc b/osaf/services/saf/amf/amfd/sg.cc
--- a/osaf/services/saf/amf/amfd/sg.cc
+++ b/osaf/services/saf/amf/amfd/sg.cc
@@ -2087,8 +2087,9 @@ uint32_t AVD_SG::curr_non_instantiated_s
 			(su->saAmfSUPresenceState == SA_AMF_PRESENCE_UNINSTANTIATED));}));	
 }
 
-void AVD_SG::adjust_intermediate_sg()
+void AVD_SG::adjust_intermediate_sg(AVD_CL_CB *cb)
 {
+	AVD_SU_SI_REL *curr_susi;
 	TRACE_ENTER();
 
 	// move SG from SHUTTING_DOWN to LOCKED
@@ -2104,5 +2105,78 @@ void AVD_SG::adjust_intermediate_sg()
 		if (si->saAmfSIAdminState == SA_AMF_ADMIN_SHUTTING_DOWN)
 			si->set_admin_state(SA_AMF_ADMIN_LOCKED);
 	}
+
+	// Check AdminState of node/sg/su whether is LOCKED
+	// which states will cause removal of assignment
+	for (const auto& su : list_of_su) {
+		SaAmfHAStateT su_ha_state;
+		TRACE("Check AdminState of SU/SG/Node, SU:'%s', saAmfSUAdminState:%u, "
+				"saAmfSGAdminState:%u, saAmfNodeAdminState:%u, "
+				"saAmfSUNumCurrActiveSIs:%u, saAmfSUNumCurrStandbySIs:%u",
+				su->name.value,
+				su->saAmfSUAdminState,
+				su->sg_of_su->saAmfSGAdminState,
+				su->su_on_node->saAmfNodeAdminState,
+				su->saAmfSUNumCurrActiveSIs,
+				su->saAmfSUNumCurrStandbySIs);
+
+		if (su->saAmfSUAdminState == SA_AMF_ADMIN_LOCKED ||
+			su->sg_of_su->saAmfSGAdminState == SA_AMF_ADMIN_LOCKED ||
+			su->su_on_node->saAmfNodeAdminState == SA_AMF_ADMIN_LOCKED) {
+
+			if (su->list_of_susi) {
+				su_ha_state = avd_su_state_determine(su);
+				if (su_ha_state == SA_AMF_HA_QUIESCED ||
+					su_ha_state == SA_AMF_HA_STANDBY ||
+					su_ha_state == SA_AMF_HA_QUIESCING) {
+					// remove all susi belong to this su
+					avd_sg_su_si_del_snd(cb, su);
+				} else if (su_ha_state == SA_AMF_HA_ACTIVE) {
+					// quiesced this su
+					avd_sg_su_si_mod_snd(cb, su, SA_AMF_HA_QUIESCED);
+				}
+				avd_sg_su_oper_list_add(cb, su, false);
+				set_fsm_state(AVD_SG_FSM_SG_REALIGN);
+			}
+		}
+	}
+
+	// Check AdminState of si whether is LOCKED
+	for (const auto& si : list_of_si) {
+
+		TRACE("Check SI:'%s', saAmfSIAdminState:%u, saAmfSINumCurrActiveAssignments:%u, "
+				"saAmfSINumCurrStandbyAssignments:%u",
+				si->name.value,
+				si->saAmfSIAdminState,
+				si->saAmfSINumCurrActiveAssignments,
+				si->saAmfSINumCurrStandbyAssignments);
+
+		for (curr_susi = si->list_of_sisu; curr_susi;
+				curr_susi = curr_susi->si_next) {
+
+			TRACE("Check SUSI:'%s,%s', HaState:%u", curr_susi->su->name.value,
+					curr_susi->si->name.value,
+					curr_susi->state);
+
+			if (si->saAmfSIAdminState == SA_AMF_ADMIN_LOCKED) {
+				// only process assigned susi, ignore the others due to
+				// being modified or unassigned, ...
+				if (curr_susi->fsm == AVD_SU_SI_STATE_ASGND) {
+					if (curr_susi->state == SA_AMF_HA_STANDBY ||
+						curr_susi->state == SA_AMF_HA_QUIESCED ||
+						curr_susi->state == SA_AMF_HA_QUIESCING) {
+						// remove one susi
+						avd_susi_del_send(curr_susi);
+					} else if (curr_susi->state == SA_AMF_HA_ACTIVE) {
+						// quiesced one susi
+						avd_susi_mod_send(curr_susi, SA_AMF_HA_QUIESCED);
+					}
+					set_fsm_state(AVD_SG_FSM_SG_REALIGN);
+					avd_sg_su_oper_list_add(cb, curr_susi->su, false);
+				}
+			}
+		}
+	}
+
 	TRACE_LEAVE();
 }
diff --git a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
--- a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
+++ b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
@@ -3538,41 +3538,6 @@ void SG_2N::adjust_delayed_failover(AVD_
 	AVD_SU_SI_REL *curr_susi;
 	TRACE_ENTER();
 
-	// Check AdminState of node/sg/su whether is LOCKED
-	// which states will cause removal of assignment
-	for (const auto& su : list_of_su) {
-		SaAmfHAStateT su_ha_state;
-		TRACE("Check AdminState of SU/SG/Node, SU:'%s', saAmfSUAdminState:%u, "
-				"saAmfSGAdminState:%u, saAmfNodeAdminState:%u, "
-				"saAmfSUNumCurrActiveSIs:%u, saAmfSUNumCurrStandbySIs:%u",
-				su->name.value,
-				su->saAmfSUAdminState,
-				su->sg_of_su->saAmfSGAdminState,
-				su->su_on_node->saAmfNodeAdminState,
-				su->saAmfSUNumCurrActiveSIs,
-				su->saAmfSUNumCurrStandbySIs);
-
-		if (su->saAmfSUAdminState == SA_AMF_ADMIN_LOCKED ||
-			su->sg_of_su->saAmfSGAdminState == SA_AMF_ADMIN_LOCKED ||
-			su->su_on_node->saAmfNodeAdminState == SA_AMF_ADMIN_LOCKED) {
-
-			if (su->list_of_susi) {
-				su_ha_state = avd_su_state_determine(su);
-				if (su_ha_state == SA_AMF_HA_QUIESCED ||
-					su_ha_state == SA_AMF_HA_STANDBY ||
-					su_ha_state == SA_AMF_HA_QUIESCING) {
-					// remove all susi belong to this su
-					avd_sg_su_si_del_snd(cb, su);
-				} else if (su_ha_state == SA_AMF_HA_ACTIVE) {
-					// quiesced this su
-					avd_sg_su_si_mod_snd(cb, su, SA_AMF_HA_QUIESCED);
-				}
-				avd_sg_su_oper_list_add(cb, su, false);
-				set_fsm_state(AVD_SG_FSM_SG_REALIGN);
-			}
-		}
-	}
-
 	// Check AdminState of si whether is LOCKED
 	for (const auto& si : list_of_si) {
 
@@ -3594,23 +3559,6 @@ void SG_2N::adjust_delayed_failover(AVD_
 					curr_susi->si->name.value,
 					curr_susi->state);
 
-			if (si->saAmfSIAdminState == SA_AMF_ADMIN_LOCKED) {
-				// only process assigned susi, ignore the others due to
-				// being modified or unassigned, ...
-				if (curr_susi->fsm == AVD_SU_SI_STATE_ASGND) {
-					if (curr_susi->state == SA_AMF_HA_STANDBY ||
-						curr_susi->state == SA_AMF_HA_QUIESCED ||
-						curr_susi->state == SA_AMF_HA_QUIESCING) {
-						// remove one susi
-						avd_susi_del_send(curr_susi);
-					} else if (curr_susi->state == SA_AMF_HA_ACTIVE) {
-						// quiesced one susi
-						avd_susi_mod_send(curr_susi, SA_AMF_HA_QUIESCED);
-					}
-					set_fsm_state(AVD_SG_FSM_SG_REALIGN);
-					avd_sg_su_oper_list_add(cb, curr_susi->su, false);
-				}
-			}
 			if (curr_susi->fsm != AVD_SU_SI_STATE_ASGND)
 				continue;
 
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to