It seems the file wasn't attached properly. Take #2.
Quoting Gary Lee <gary....@dektech.com.au>:
Hi Nagu
Please ignore fix.patch Minh previously sent, and replace with this version.
Thanks
Gary
On 19 Feb 2016, at 8:09 PM, minh chau <minh.c...@dektech.com.au> wrote:
Hi Nagu,
Thanks for your testing.
Below is our investigation from TC1 - TC31 which seem to be
important, plus some patches that we're trying to fix the issues
1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11)
Discussion is on-going. When we hit the limitation, which causes
mismatch of objects in amfd/imm vs amfnd:
- amfd's trying to tolerate the mismatch or the worst case that
amfd orders node reboot to the last payload in cluster to avoid
amfd cyclic crash
2. Suspicious setting SU oper state in amfnd (TC #13, #24)
According the trace in TC24, after unlock-in nodegroup, amfnd
change SU oper state to DISABLED, which is wrong since no fault
happens on SU.
We can raise ticket on non-headless code base. Though we think the
patch 1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now
TC#13 has similar problem, but provided trace is only for amfd, so
we don't know where amfnd changed SU oper state to DISABLED. And we
haven't been able to reproduce this problem in TC#13 so far
3. Problem in TC #16
It seems the fault lies in the base code when the system is not
headless. The admin state should not stay in unlocked state. We can
raise a ticket on the current non-headless code later.
4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, #22, #26)
Have patch for this, please apply fix.patch
5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14,
#17, #19, #20, #21)
Have patch for this, please apply fix.patch
6. Support Nodegroup handling in delayed failover (TC #23)
At the time we developed AMF cloud resilience, we haven't had
nodegroup pushed. So we missed it.
Please apply the patch 1620_amfd_adjust_interm_admin_state.diff
7. Problem in TC #25
We think it's not really a fault. Please see our opinion at the end of email
8. Delayed failover needs to check csi level (TC #27)
Fault reproducible, however it seems a rare use case where user
creates extra csi just before decides to go headless. Fix is on going
9. Recover non existed csi (TC #28)
CSI had been deleted in IMM, but there's delay at application so
its assignment object is still in amfnd at the time recovery. The
patch just ignores to re-create this non-existed csi, please apply
1620_amfd_ignore_nonexisted_csi.diff
10. Delayed si dep issue (TC #29, #30)
Have patch for this, please apply
1620_amfd_add_su_op_list_delayed_sidep.diff
11. About TC #31, test case has fault ?
The trace shows A sponsor C, the test lock B and expect C is removed
Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR
safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51)
Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14
should be fixed by attached patches)
The tests reported on TC #32 to TC #40 on Npm, Nway, that we
haven't planned to support it since haven't seen headless user
using those model, so they should be buggy. We added this
limitation to README, or it should be an enhancement in future.
So we're still working on some remaining TCs (#2, #3, #4, #15, #27,
#41, #42), alarm/notification related issues, and testing the
patches.
If you find any other problems (or if you are not too busy to help
reproduce TC#2, #12, #13, #14), the traces are much helpful though
it would be nicer that we can have your testing model? (so we can
quickly know which attr is on/off)
Thanks,
Minh
Opinion on TC#25
------------------------
TC25:
- At the time SC1 restarts, amfd adjusts the assignment. amfd
decides to remove QUIESCED assignment of
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, because the gdb still
hangs the
quiesced csi_set_callback
Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >>
avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
- amfnd-PL3 receives su_si_del msg, buffer it
Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >>
avnd_evt_avd_info_su_si_assign_evh:
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >>
avnd_su_siq_rec_buf: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >>
avnd_su_siq_rec_add: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
- the gdb releases csi_set_callback, the quiesced assignment
sequence continues, it's finished and report to amfd
Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >>
avnd_di_susi_resp_send: Sending Resp
su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1,
si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1
Then amfnd pulls out the su_si_del which is buffered and continue
the removal assignment sequence. This sequence finishes and amfnd
report to amfd
Feb 11 19:46:27.329483 osafamfnd [16881:di.cc:0857] TR Sending.
msg_id'3', node_id'131855', msg_act'4',
su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', si'', ha_state'3',
error'1', single_csi'0'
- At amfd, upon receiving the report of quiesced assignment
completion, amfd decides to remove
quiesced assignment of SU1
Feb 11 19:46:27.086796 osafamfd [28309:sgproc.cc:2328] >>
avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
- So amfnd on PL3 receives another extra su_si_del msg, logs error
Feb 11 19:46:27.330333 osafamfnd [16881:su.cc:0376] >>
avnd_evt_avd_info_su_si_assign_evh:
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Feb 11 19:46:27.330359 osafamfnd [16881:su.cc:0425] ER
susi_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no
assignments
Feb 11 19:46:27.330369 osafamfnd [16881:su.cc:0447] <<
avnd_evt_avd_info_su_si_assign_evh: 1
This is not a real problem, we think we can change it as warning
The improvement could be at amfnd side for non-headless code base,
so that it can flexibly handle unexpected extra su_si event. #1386
has found this kind of limitation of amfnd where it could not
handle another su si event as it expected, it's due to multiple
failures happening along with admin command.
<1620_amfd_add_su_op_list_delayed_sidep.diff><1620_amfd_adjust_interm_admin_state.diff><1620_amfd_ignore_nonexisted_csi.diff><1620_amfnd_dont_disabled_healthy_su.diff><fix.patch>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel
diff --git a/osaf/services/saf/amf/amfd/role.cc b/osaf/services/saf/amf/amfd/role.cc
--- a/osaf/services/saf/amf/amfd/role.cc
+++ b/osaf/services/saf/amf/amfd/role.cc
@@ -180,6 +180,11 @@
goto done;
}
+ if (avd_imm_impl_set() != SA_AIS_OK) {
+ LOG_ER("avd_imm_impl_set FAILED");
+ goto done;
+ }
+
if (avd_imm_config_get() != NCSCC_RC_SUCCESS) {
LOG_ER("avd_imm_config_get FAILED");
goto done;
@@ -187,11 +192,6 @@
cb->init_state = AVD_CFG_DONE;
- if (avd_imm_impl_set() != SA_AIS_OK) {
- LOG_ER("avd_imm_impl_set FAILED");
- goto done;
- }
-
avd_imm_update_runtime_attrs();
status = NCSCC_RC_SUCCESS;
diff --git a/osaf/services/saf/amf/amfd/sgproc.cc b/osaf/services/saf/amf/amfd/sgproc.cc
--- a/osaf/services/saf/amf/amfd/sgproc.cc
+++ b/osaf/services/saf/amf/amfd/sgproc.cc
@@ -282,7 +282,7 @@
{
TRACE_ENTER2("Repair for SU:'%s'", su->name.value);
- if ((su->sg_of_su->saAmfSGAutoRepair) && (su->saAmfSUFailover) &&
+ if ((su->sg_of_su->saAmfSGAutoRepair) &&
(su->saAmfSUOperState == SA_AMF_OPERATIONAL_DISABLED) &&
(su->saAmfSUPresenceState != SA_AMF_PRESENCE_INSTANTIATION_FAILED) &&
(su->saAmfSUPresenceState != SA_AMF_PRESENCE_TERMINATION_FAILED)) {
diff --git a/osaf/services/saf/amf/amfd/siass.cc b/osaf/services/saf/amf/amfd/siass.cc
--- a/osaf/services/saf/amf/amfd/siass.cc
+++ b/osaf/services/saf/amf/amfd/siass.cc
@@ -888,7 +888,9 @@
susi->su->inc_curr_act_si();
susi->si->inc_curr_act_ass();
}
-
+ su->saAmfSUHostedByNode = node->name;
+ avd_saImmOiRtObjectUpdate(&su->name, "saAmfSUHostedByNode",
+ SA_IMM_ATTR_SANAMET, &su->saAmfSUHostedByNode);
m_AVSV_SEND_CKPT_UPDT_ASYNC_ADD(avd_cb, susi, AVSV_CKPT_AVD_SI_ASS);
}
diff --git a/osaf/services/saf/amf/amfnd/err.cc b/osaf/services/saf/amf/amfnd/err.cc
--- a/osaf/services/saf/amf/amfnd/err.cc
+++ b/osaf/services/saf/amf/amfnd/err.cc
@@ -773,28 +773,6 @@
LOG_ER("cleanup of '%s' failed", failed_comp->name.value);
goto done;
}
-
- // if headless, remove all assignments from this SU
- if (cb->is_avd_down == true) {
- AVND_SU_SI_REC *si = 0;
- AVND_SU_SI_REC *next_si = 0;
- uint32_t rc = NCSCC_RC_SUCCESS;
- TRACE("Removing assignments from '%s'", su->name.value);
-
- m_AVND_SU_ASSIGN_PEND_SET(su);
-
- /* scan the su-si list & remove the sis */
- for (si = (AVND_SU_SI_REC *)m_NCS_DBLIST_FIND_FIRST(&su->si_list); si;) {
- next_si = (AVND_SU_SI_REC *)m_NCS_DBLIST_FIND_NEXT(&si->su_dll_node);
- rc = avnd_su_si_remove(cb, su, si);
- if (NCSCC_RC_SUCCESS != rc) {
- LOG_ER("failed to remove SI assignment from '%s'",
- su->name.value);
- break;
- }
- si = next_si;
- }
- }
} else {
/* request director to orchestrate component failover */
rc = avnd_di_oper_send(cb, failed_comp->su, AVSV_ERR_RCVR_SU_FAILOVER);
@@ -1382,10 +1360,13 @@
TRACE_ENTER();
/* first time in this level */
- if (su->sufailover)
+ if (su->sufailover) {
*esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
- else
+ } else if (su->sufailover == false && su->is_ncs == false && cb->is_avd_down == true) {
+ *esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
+ } else {
*esc_rcvr = static_cast<AVSV_ERR_RCVR>(SA_AMF_COMPONENT_FAILOVER);
+ }
/* External components are not supposed to escalate SU Failover of
cluster components. For Ext component, SU Failover will be limited to
@@ -1450,10 +1431,14 @@
TRACE_ENTER();
/* initialize */
- if (su->sufailover)
+ if (su->sufailover) {
*esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
- else
+ } else if (su->sufailover == false && su->is_ncs == false && cb->is_avd_down == true) {
+ LOG_NO("Director is down. Escalate to SU failover for '%s'",su->name.value);
+ *esc_rcvr = AVSV_ERR_RCVR_SU_FAILOVER;
+ } else {
*esc_rcvr = static_cast<AVSV_ERR_RCVR>(SA_AMF_COMPONENT_FAILOVER);
+ }
if (true == su->su_is_external) {
/* External component should not contribute to NODE FAILOVER of cluster
diff --git a/osaf/services/saf/amf/amfnd/susm.cc b/osaf/services/saf/amf/amfnd/susm.cc
--- a/osaf/services/saf/amf/amfnd/susm.cc
+++ b/osaf/services/saf/amf/amfnd/susm.cc
@@ -738,8 +738,13 @@
/* if no si is specified, the action is aimed at all the sis... pick up any si */
curr_si = (si) ? si : (AVND_SU_SI_REC *)m_NCS_DBLIST_FIND_FIRST(&su->si_list);
- if (!curr_si)
+ if (!curr_si) {
+ // after headless, we may have a buffered susi remove msg
+ // if the susi can't be found (already removed), reset flag
+ LOG_NO("no SI found in '%s'", su->name.value);
+ m_AVND_SU_ALL_SI_RESET(su);
goto done;
+ }
/* initiate the si removal for pi su */
if (m_AVND_SU_IS_PREINSTANTIABLE(su)) {
@@ -3474,10 +3479,19 @@
*/
bool sufailover_in_progress(const AVND_SU *su)
{
+ TRACE_ENTER2("%s", su->name.value);
if (m_AVND_SU_IS_FAILED(su) && (su->sufailover) && (!m_AVND_SU_IS_RESTART(su)) &&
- (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) && (!su->is_ncs))
+ (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) && (!su->is_ncs)) {
+ TRACE_LEAVE();
return true;
- return false;
+ } else if (m_AVND_SU_IS_FAILED(su) && (su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) &&
+ (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) && (!su->is_ncs) && avnd_cb->is_avd_down == true) {
+ TRACE_LEAVE();
+ return true;
+ } else {
+ TRACE_LEAVE();
+ return false;
+ }
}
/**
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel