Hi,

please see enclosed patch for TC #1, #6, #7, #8, #9, #10 and #11/Thanks HansN

On 02/19/2016 10:09 AM, minh chau wrote:
Hi Nagu,

Thanks for your testing.
Below is our investigation from TC1 - TC31 which seem to be important, plus some patches that we're trying to fix the issues

1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11)
Discussion is on-going. When we hit the limitation, which causes mismatch of objects in amfd/imm vs amfnd: - amfd's trying to tolerate the mismatch or the worst case that amfd orders node reboot to the last payload in cluster to avoid amfd cyclic crash

2. Suspicious setting SU oper state in amfnd (TC #13, #24)
According the trace in TC24, after unlock-in nodegroup, amfnd change SU oper state to DISABLED, which is wrong since no fault happens on SU. We can raise ticket on non-headless code base. Though we think the patch 1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now TC#13 has similar problem, but provided trace is only for amfd, so we don't know where amfnd changed SU oper state to DISABLED. And we haven't been able to reproduce this problem in TC#13 so far

3. Problem in TC #16
It seems the fault lies in the base code when the system is not headless. The admin state should not stay in unlocked state. We can raise a ticket on the current non-headless code later.

4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, #22, #26)
Have patch for this, please apply fix.patch

5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14, #17, #19, #20, #21)
Have patch for this, please apply fix.patch

6. Support Nodegroup handling in delayed failover (TC #23)
At the time we developed AMF cloud resilience, we haven't had nodegroup pushed. So we missed it.
Please apply the patch 1620_amfd_adjust_interm_admin_state.diff

7. Problem in TC #25
We think it's not really a fault. Please see our opinion at the end of email

8. Delayed failover needs to check csi level (TC #27)
Fault reproducible, however it seems a rare use case where user creates extra csi just before decides to go headless. Fix is on going

9. Recover non existed csi (TC #28)
CSI had been deleted in IMM, but there's delay at application so its assignment object is still in amfnd at the time recovery. The patch just ignores to re-create this non-existed csi, please apply 1620_amfd_ignore_nonexisted_csi.diff

10. Delayed si dep issue (TC #29, #30)
Have patch for this, please apply 1620_amfd_add_su_op_list_delayed_sidep.diff

11. About TC #31, test case has fault ?
The trace shows A sponsor C, the test lock B and expect C is removed
Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51)

Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14 should be fixed by attached patches)

The tests reported on TC #32 to TC #40 on Npm, Nway, that we haven't planned to support it since haven't seen headless user using those model, so they should be buggy. We added this limitation to README, or it should be an enhancement in future.

So we're still working on some remaining TCs (#2, #3, #4, #15, #27, #41, #42), alarm/notification related issues, and testing the patches. If you find any other problems (or if you are not too busy to help reproduce TC#2, #12, #13, #14), the traces are much helpful though it would be nicer that we can have your testing model? (so we can quickly know which attr is on/off)

Thanks,
Minh


Opinion on TC#25
------------------------
TC25:
- At the time SC1 restarts, amfd adjusts the assignment. amfd decides to remove QUIESCED assignment of safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, because the gdb still hangs the
quiesced csi_set_callback

Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- amfnd-PL3 receives su_si_del msg, buffer it
Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >> avnd_su_siq_rec_buf: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >> avnd_su_siq_rec_add: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- the gdb releases csi_set_callback, the quiesced assignment sequence continues, it's finished and report to amfd Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >> avnd_di_susi_resp_send: Sending Resp su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1

Then amfnd pulls out the su_si_del which is buffered and continue the removal assignment sequence. This sequence finishes and amfnd report to amfd Feb 11 19:46:27.329483 osafamfnd [16881:di.cc:0857] TR Sending. msg_id'3', node_id'131855', msg_act'4', su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', si'', ha_state'3', error'1', single_csi'0'

- At amfd, upon receiving the report of quiesced assignment completion, amfd decides to remove
quiesced assignment of SU1
Feb 11 19:46:27.086796 osafamfd [28309:sgproc.cc:2328] >> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

- So amfnd on PL3 receives another extra su_si_del msg, logs error
Feb 11 19:46:27.330333 osafamfnd [16881:su.cc:0376] >> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Feb 11 19:46:27.330359 osafamfnd [16881:su.cc:0425] ER susi_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments Feb 11 19:46:27.330369 osafamfnd [16881:su.cc:0447] << avnd_evt_avd_info_su_si_assign_evh: 1

This is not a real problem, we think we can change it as warning
The improvement could be at amfnd side for non-headless code base, so that it can flexibly handle unexpected extra su_si event. #1386 has found this kind of limitation of amfnd where it could not handle another su si event as it expected, it's due to multiple failures happening along with admin command.








# HG changeset patch
# User Hans Nordeback <hans.nordeb...@ericsson.com>
# Date 1456144854 -3600
#      Mon Feb 22 13:40:54 2016 +0100
# Node ID 76cc1a6bb7f2670a14ed24590a1b6b537626355d
# Parent  743bffd74700c2174a4af2f3ac0bbfa1de246f69
amfd: Reboot cluster at data inconsistency [#1620]

diff --git a/osaf/services/saf/amf/amfd/include/util.h b/osaf/services/saf/amf/amfd/include/util.h
--- a/osaf/services/saf/amf/amfd/include/util.h
+++ b/osaf/services/saf/amf/amfd/include/util.h
@@ -74,6 +74,7 @@ uint32_t avd_snd_node_up_msg(struct cl_c
 uint32_t avd_snd_presence_msg(struct cl_cb_tag *cb, AVD_SU *su, bool term_state);
 uint32_t avd_snd_oper_state_msg(struct cl_cb_tag *cb, AVD_AVND *avnd, uint32_t msg_id_ack);
 uint32_t avd_snd_op_req_msg(struct cl_cb_tag *cb, AVD_AVND *avnd, AVSV_PARAM_INFO *param_info);
+void avd_d2n_snd_reboot_req_msg();
 uint32_t avd_snd_su_reg_msg(struct cl_cb_tag *cb, AVD_AVND *avnd, bool fail_over);
 uint32_t avd_snd_su_msg(struct cl_cb_tag *cb, AVD_SU *su);
 uint32_t avd_snd_susi_msg(struct cl_cb_tag *cb, AVD_SU *su, struct avd_su_si_rel_tag *susi,
diff --git a/osaf/services/saf/amf/amfd/siass.cc b/osaf/services/saf/amf/amfd/siass.cc
--- a/osaf/services/saf/amf/amfd/siass.cc
+++ b/osaf/services/saf/amf/amfd/siass.cc
@@ -839,7 +839,14 @@ SaAisErrorT avd_susi_recreate(AVSV_N2D_N
 		su_state = su_state->next) {
 
 		AVD_SU *su = su_db->find(Amf::to_string(&su_state->safSU));
-		osafassert(su);
+                if (su == nullptr) {
+                    LOG_ER("SU data inconsistency detected. Ordering cluster reboot");
+		    avd_d2n_snd_reboot_req_msg();
+		    for (;;) {
+			LOG_ER("Waiting for reboot");
+			sleep(1);
+		    }
+                }
 
 		// present state
 		su->set_pres_state(static_cast<SaAmfPresenceStateT>(su_state->su_pres_state));
diff --git a/osaf/services/saf/amf/amfd/util.cc b/osaf/services/saf/amf/amfd/util.cc
--- a/osaf/services/saf/amf/amfd/util.cc
+++ b/osaf/services/saf/amf/amfd/util.cc
@@ -1916,3 +1916,25 @@ bool admin_op_is_valid(SaImmAdminOperati
  	child_dn->length = i;
  	return 0;
  }
+
+ /**
+ * Broadcasts a reboot request to all amf node directors. 
+ * Use broadcast as adests may not be available at the time of reboot
+ * request.
+ */
+void avd_d2n_snd_reboot_req_msg() {
+  TRACE_ENTER();
+  
+  AVD_DND_MSG *reboot_req_msg = new AVSV_DND_MSG();
+
+  /* prepare the reboot request message. */
+  reboot_req_msg->msg_type = AVSV_D2N_REBOOT_MSG;
+  reboot_req_msg->msg_info.d2n_reboot_info.msg_id = 0;
+  
+  /* Broadcast the operation request message to all the nodes. */
+  avd_d2n_msg_bcast(avd_cb, reboot_req_msg);
+
+  delete reboot_req_msg;
+
+  TRACE_LEAVE();
+}
\ No newline at end of file
diff --git a/osaf/services/saf/amf/amfnd/amfnd.cc b/osaf/services/saf/amf/amfnd/amfnd.cc
--- a/osaf/services/saf/amf/amfnd/amfnd.cc
+++ b/osaf/services/saf/amf/amfnd/amfnd.cc
@@ -386,7 +386,8 @@ uint32_t avnd_evt_avd_reboot_evh(AVND_CB
 
 	osafassert(AVSV_D2N_REBOOT_MSG == evt->info.avd->msg_type);
 
-	avnd_msgid_assert(info->msg_id);
+	if (info->msg_id)
+		avnd_msgid_assert(info->msg_id);
 	cb->rcv_msg_id = info->msg_id;
 
 	/* Clear error report related alarms before reboot. 
diff --git a/osaf/services/saf/amf/amfnd/mds.cc b/osaf/services/saf/amf/amfnd/mds.cc
--- a/osaf/services/saf/amf/amfnd/mds.cc
+++ b/osaf/services/saf/amf/amfnd/mds.cc
@@ -337,7 +337,8 @@ uint32_t avnd_mds_rcv(AVND_CB *cb, MDS_C
 		 * message, to the Anchor of the received message.
 		 */
 		if ((AVSV_D2N_NODE_UP_MSG == ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type) ||
-		    (AVSV_D2N_DATA_VERIFY_MSG == ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type)) {
+		    (AVSV_D2N_DATA_VERIFY_MSG == ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type) ||
+		    (AVSV_D2N_REBOOT_MSG == ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type)) {
 			cb->active_avd_adest = rcv_info->i_fr_dest;
 			TRACE_1("Active AVD Adest = %" PRIu64 ,cb->active_avd_adest);
 		}
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to