It seems the file wasn't attached properly. Take #2.

Quoting Gary Lee <gary....@dektech.com.au>:

Hi Nagu

Please ignore fix.patch Minh previously sent, and replace with this version.



Thanks
Gary

On 19 Feb 2016, at 8:09 PM, minh chau <minh.c...@dektech.com.au> wrote:

Hi Nagu,

Thanks for your testing.
Below is our investigation from TC1 - TC31 which seem to be important, plus some patches that we're trying to fix the issues

1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11)
Discussion is on-going. When we hit the limitation, which causes mismatch of objects in amfd/imm vs amfnd: - amfd's trying to tolerate the mismatch or the worst case that amfd orders node reboot to the last payload in cluster to avoid amfd cyclic crash

2. Suspicious setting SU oper state in amfnd (TC #13, #24)
According the trace in TC24, after unlock-in nodegroup, amfnd change SU oper state to DISABLED, which is wrong since no fault happens on SU. We can raise ticket on non-headless code base. Though we think the patch 1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now TC#13 has similar problem, but provided trace is only for amfd, so we don't know where amfnd changed SU oper state to DISABLED. And we haven't been able to reproduce this problem in TC#13 so far

3. Problem in TC #16
It seems the fault lies in the base code when the system is not headless. The admin state should not stay in unlocked state. We can raise a ticket on the current non-headless code later.

4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, #22, #26)
Have patch for this, please apply fix.patch

5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14, #17, #19, #20, #21)
Have patch for this, please apply fix.patch

6. Support Nodegroup handling in delayed failover (TC #23)
At the time we developed AMF cloud resilience, we haven't had nodegroup pushed. So we missed it.
Please apply the patch 1620_amfd_adjust_interm_admin_state.diff

7. Problem in TC #25
We think it's not really a fault. Please see our opinion at the end of email

8. Delayed failover needs to check csi level (TC #27)
Fault reproducible, however it seems a rare use case where user creates extra csi just before decides to go headless. Fix is on going

9. Recover non existed csi (TC #28)
CSI had been deleted in IMM, but there's delay at application so its assignment object is still in amfnd at the time recovery. The patch just ignores to re-create this non-existed csi, please apply 1620_amfd_ignore_nonexisted_csi.diff

10. Delayed si dep issue (TC #29, #30)
Have patch for this, please apply 1620_amfd_add_su_op_list_delayed_sidep.diff

11. About TC #31, test case has fault ?
The trace shows A sponsor C, the test lock B and expect C is removed
Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51)

Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14 should be fixed by attached patches)

The tests reported on TC #32 to TC #40 on Npm, Nway, that we haven't planned to support it since haven't seen headless user using those model, so they should be buggy. We added this limitation to README, or it should be an enhancement in future.

So we're still working on some remaining TCs (#2, #3, #4, #15, #27, #41, #42), alarm/notification related issues, and testing the patches. If you find any other problems (or if you are not too busy to help reproduce TC#2, #12, #13, #14), the traces are much helpful though it would be nicer that we can have your testing model? (so we can quickly know which attr is on/off)

Thanks,
Minh


Opinion on TC#25
------------------------
TC25:
- At the time SC1 restarts, amfd adjusts the assignment. amfd decides to remove QUIESCED assignment of safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, because the gdb still hangs the
quiesced csi_set_callback

Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- amfnd-PL3 receives su_si_del msg, buffer it
Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >> avnd_su_siq_rec_buf: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >> avnd_su_siq_rec_add: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- the gdb releases csi_set_callback, the quiesced assignment sequence continues, it's finished and report to amfd Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >> avnd_di_susi_resp_send: Sending Resp su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1

Then amfnd pulls out the su_si_del which is buffered and continue the removal assignment sequence. This sequence finishes and amfnd report to amfd Feb 11 19:46:27.329483 osafamfnd [16881:di.cc:0857] TR Sending. msg_id'3', node_id'131855', msg_act'4', su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', si'', ha_state'3', error'1', single_csi'0'

- At amfd, upon receiving the report of quiesced assignment completion, amfd decides to remove
quiesced assignment of SU1
Feb 11 19:46:27.086796 osafamfd [28309:sgproc.cc:2328] >> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- So amfnd on PL3 receives another extra su_si_del msg, logs error
Feb 11 19:46:27.330333 osafamfnd [16881:su.cc:0376] >> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:27.330359 osafamfnd [16881:su.cc:0425] ER susi_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments Feb 11 19:46:27.330369 osafamfnd [16881:su.cc:0447] << avnd_evt_avd_info_su_si_assign_evh: 1

This is not a real problem, we think we can change it as warning
The improvement could be at amfnd side for non-headless code base, so that it can flexibly handle unexpected extra su_si event. #1386 has found this kind of limitation of amfnd where it could not handle another su si event as it expected, it's due to multiple failures happening along with admin command.







<1620_amfd_add_su_op_list_delayed_sidep.diff><1620_amfd_adjust_interm_admin_state.diff><1620_amfd_ignore_nonexisted_csi.diff><1620_amfnd_dont_disabled_healthy_su.diff><fix.patch>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


diff --git a/osaf/services/saf/amf/amfd/role.cc b/osaf/services/saf/amf/amfd/role.cc
--- a/osaf/services/saf/amf/amfd/role.cc
+++ b/osaf/services/saf/amf/amfd/role.cc
@@ -180,6 +180,11 @@
 		goto done;
 	}
 
+	if (avd_imm_impl_set() != SA_AIS_OK) {
+		LOG_ER("avd_imm_impl_set FAILED");
+		goto done;
+	}
+
 	if (avd_imm_config_get() != NCSCC_RC_SUCCESS) {
 		LOG_ER("avd_imm_config_get FAILED");
 		goto done;
@@ -187,11 +192,6 @@
 
 	cb->init_state = AVD_CFG_DONE;
 
-	if (avd_imm_impl_set() != SA_AIS_OK) {
-		LOG_ER("avd_imm_impl_set FAILED");
-		goto done;
-	}
-
 	avd_imm_update_runtime_attrs();
 
 	status = NCSCC_RC_SUCCESS;
diff --git a/osaf/services/saf/amf/amfd/sgproc.cc b/osaf/services/saf/amf/amfd/sgproc.cc
--- a/osaf/services/saf/amf/amfd/sgproc.cc
+++ b/osaf/services/saf/amf/amfd/sgproc.cc
@@ -282,7 +282,7 @@
 {
 	TRACE_ENTER2("Repair for SU:'%s'", su->name.value);
 
-	if ((su->sg_of_su->saAmfSGAutoRepair) && (su->saAmfSUFailover) &&
+	if ((su->sg_of_su->saAmfSGAutoRepair) &&
 			(su->saAmfSUOperState == SA_AMF_OPERATIONAL_DISABLED) &&
 			(su->saAmfSUPresenceState != SA_AMF_PRESENCE_INSTANTIATION_FAILED) && 
 			(su->saAmfSUPresenceState != SA_AMF_PRESENCE_TERMINATION_FAILED)) {
diff --git a/osaf/services/saf/amf/amfd/siass.cc b/osaf/services/saf/amf/amfd/siass.cc
--- a/osaf/services/saf/amf/amfd/siass.cc
+++ b/osaf/services/saf/amf/amfd/siass.cc
@@ -888,7 +888,9 @@
 			susi->su->inc_curr_act_si();
 			susi->si->inc_curr_act_ass();
 		}
-
+		su->saAmfSUHostedByNode = node->name;
+		avd_saImmOiRtObjectUpdate(&su->name, "saAmfSUHostedByNode",
+		SA_IMM_ATTR_SANAMET, &su->saAmfSUHostedByNode);
 		m_AVSV_SEND_CKPT_UPDT_ASYNC_ADD(avd_cb, susi, AVSV_CKPT_AVD_SI_ASS);
 	}
 
diff --git a/osaf/services/saf/amf/amfnd/err.cc b/osaf/services/saf/amf/amfnd/err.cc
--- a/osaf/services/saf/amf/amfnd/err.cc
+++ b/osaf/services/saf/amf/amfnd/err.cc
@@ -773,28 +773,6 @@
 			LOG_ER("cleanup of '%s' failed", failed_comp->name.value);
 			goto done;
 		}
-
-		// if headless, remove all assignments from this SU
-		if (cb->is_avd_down == true) {
-			AVND_SU_SI_REC *si = 0;
-			AVND_SU_SI_REC *next_si = 0;
-			uint32_t rc = NCSCC_RC_SUCCESS;
-			TRACE("Removing assignments from '%s'", su->name.value);
-
-			m_AVND_SU_ASSIGN_PEND_SET(su);
-
-			/* scan the su-si list & remove the sis */
-			for (si = (AVND_SU_SI_REC *)m_NCS_DBLIST_FIND_FIRST(&su->si_list); si;) {
-				next_si = (AVND_SU_SI_REC *)m_NCS_DBLIST_FIND_NEXT(&si->su_dll_node);
-				rc = avnd_su_si_remove(cb, su, si);
-				if (NCSCC_RC_SUCCESS != rc) {
-					LOG_ER("failed to remove SI assignment from '%s'",
-						su->name.value);
-					break;
-				}
-				si = next_si;
-			}
-		}
 	} else  {
 		/* request director to orchestrate component failover */
 		rc = avnd_di_oper_send(cb, failed_comp->su, AVSV_ERR_RCVR_SU_FAILOVER);
@@ -1382,10 +1360,13 @@
 	TRACE_ENTER();
 
 	/* first time in this level */
-	if (su->sufailover)
+	if (su->sufailover) {
 		*esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
-	else
+	} else if (su->sufailover == false && su->is_ncs == false && cb->is_avd_down == true) {
+		*esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
+	} else {
 		*esc_rcvr = static_cast<AVSV_ERR_RCVR>(SA_AMF_COMPONENT_FAILOVER);
+	}
 
 	/* External components are not supposed to escalate SU Failover of
 	   cluster components. For Ext component, SU Failover will be limited to
@@ -1450,10 +1431,14 @@
 	TRACE_ENTER();
 
 	/* initialize */
-	if (su->sufailover)
+	if (su->sufailover) {
 		*esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
-	else
+	} else if (su->sufailover == false && su->is_ncs == false && cb->is_avd_down == true) {
+		LOG_NO("Director is down. Escalate to SU failover for '%s'",su->name.value);
+		*esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
+	} else {
 		*esc_rcvr = static_cast<AVSV_ERR_RCVR>(SA_AMF_COMPONENT_FAILOVER);
+	}
 
 	if (true == su->su_is_external) {
 		/* External component should not contribute to NODE FAILOVER of cluster
diff --git a/osaf/services/saf/amf/amfnd/susm.cc b/osaf/services/saf/amf/amfnd/susm.cc
--- a/osaf/services/saf/amf/amfnd/susm.cc
+++ b/osaf/services/saf/amf/amfnd/susm.cc
@@ -738,8 +738,13 @@
 
 	/* if no si is specified, the action is aimed at all the sis... pick up any si */
 	curr_si = (si) ? si : (AVND_SU_SI_REC *)m_NCS_DBLIST_FIND_FIRST(&su->si_list);
-	if (!curr_si)
+	if (!curr_si) {
+		// after headless, we may have a buffered susi remove msg
+		// if the susi can't be found (already removed), reset flag
+		LOG_NO("no SI found in '%s'", su->name.value);
+		m_AVND_SU_ALL_SI_RESET(su);
 		goto done;
+	}
 
 	/* initiate the si removal for pi su */
 	if (m_AVND_SU_IS_PREINSTANTIABLE(su)) {
@@ -3474,10 +3479,19 @@
  */
 bool sufailover_in_progress(const AVND_SU *su)
 {
+	TRACE_ENTER2("%s", su->name.value);
 	if (m_AVND_SU_IS_FAILED(su) && (su->sufailover) && (!m_AVND_SU_IS_RESTART(su)) &&
-			 (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) && (!su->is_ncs))
+			 (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) && (!su->is_ncs)) {
+				TRACE_LEAVE();
 				return true;
-	return false;
+	} else if (m_AVND_SU_IS_FAILED(su) && (su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) &&
+			 (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) && (!su->is_ncs) && avnd_cb->is_avd_down == true) {
+				TRACE_LEAVE();
+				return true;
+	} else {
+		TRACE_LEAVE();
+		return false;
+	}
 }
 
 /**
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to