Hi Nagu Please ignore fix.patch Minh previously sent, and replace with this version.
Thanks Gary > On 19 Feb 2016, at 8:09 PM, minh chau <minh.c...@dektech.com.au> wrote: > > Hi Nagu, > > Thanks for your testing. > Below is our investigation from TC1 - TC31 which seem to be important, plus > some patches that we're trying to fix the issues > > 1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11) > Discussion is on-going. When we hit the limitation, which causes mismatch of > objects in amfd/imm vs amfnd: > - amfd's trying to tolerate the mismatch or the worst case that amfd orders > node reboot to the last payload in cluster to avoid amfd cyclic crash > > 2. Suspicious setting SU oper state in amfnd (TC #13, #24) > According the trace in TC24, after unlock-in nodegroup, amfnd change SU oper > state to DISABLED, which is wrong since no fault happens on SU. > We can raise ticket on non-headless code base. Though we think the patch > 1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now > TC#13 has similar problem, but provided trace is only for amfd, so we don't > know where amfnd changed SU oper state to DISABLED. And we haven't been able > to reproduce this problem in TC#13 so far > > 3. Problem in TC #16 > It seems the fault lies in the base code when the system is not headless. The > admin state should not stay in unlocked state. We can raise a ticket on the > current non-headless code later. > > 4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, #22, #26) > Have patch for this, please apply fix.patch > > 5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14, #17, #19, > #20, #21) > Have patch for this, please apply fix.patch > > 6. Support Nodegroup handling in delayed failover (TC #23) > At the time we developed AMF cloud resilience, we haven't had nodegroup > pushed. So we missed it. > Please apply the patch 1620_amfd_adjust_interm_admin_state.diff > > 7. Problem in TC #25 > We think it's not really a fault. Please see our opinion at the end of email > > 8. Delayed failover needs to check csi level (TC #27) > Fault reproducible, however it seems a rare use case where user creates extra > csi just before decides to go headless. Fix is on going > > 9. Recover non existed csi (TC #28) > CSI had been deleted in IMM, but there's delay at application so its > assignment object is still in amfnd at the time recovery. The patch just > ignores to re-create this non-existed csi, please apply > 1620_amfd_ignore_nonexisted_csi.diff > > 10. Delayed si dep issue (TC #29, #30) > Have patch for this, please apply 1620_amfd_add_su_op_list_delayed_sidep.diff > > 11. About TC #31, test case has fault ? > The trace shows A sponsor C, the test lock B and expect C is removed > Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR > safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51) > > Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14 should be > fixed by attached patches) > > The tests reported on TC #32 to TC #40 on Npm, Nway, that we haven't planned > to support it since haven't seen headless user using those model, so they > should be buggy. We added this limitation to README, or it should be an > enhancement in future. > > So we're still working on some remaining TCs (#2, #3, #4, #15, #27, #41, > #42), alarm/notification related issues, and testing the patches. > If you find any other problems (or if you are not too busy to help reproduce > TC#2, #12, #13, #14), the traces are much helpful though it would be nicer > that we can have your testing model? (so we can quickly know which attr is > on/off) > > Thanks, > Minh > > > Opinion on TC#25 > ------------------------ > TC25: > - At the time SC1 restarts, amfd adjusts the assignment. amfd decides to > remove QUIESCED assignment of safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, > because the gdb still hangs the > quiesced csi_set_callback > > Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >> > avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > > - amfnd-PL3 receives su_si_del msg, buffer it > Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >> > avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >> avnd_su_siq_rec_buf: > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >> avnd_su_siq_rec_add: > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > > - the gdb releases csi_set_callback, the quiesced assignment sequence > continues, it's finished and report to amfd > Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >> > avnd_di_susi_resp_send: Sending Resp > su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, si=safSi=AmfDemo,safApp=AmfDemo1, > curr_state=3, prv_state=1 > > Then amfnd pulls out the su_si_del which is buffered and continue the removal > assignment sequence. This sequence finishes and amfnd report to amfd > Feb 11 19:46:27.329483 osafamfnd [16881:di.cc:0857] TR Sending. msg_id'3', > node_id'131855', msg_act'4', su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', > si'', ha_state'3', error'1', single_csi'0' > > - At amfd, upon receiving the report of quiesced assignment completion, amfd > decides to remove > quiesced assignment of SU1 > Feb 11 19:46:27.086796 osafamfd [28309:sgproc.cc:2328] >> > avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > > - So amfnd on PL3 receives another extra su_si_del msg, logs error > Feb 11 19:46:27.330333 osafamfnd [16881:su.cc:0376] >> > avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' > Feb 11 19:46:27.330359 osafamfnd [16881:su.cc:0425] ER susi_assign_evh: > 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments > Feb 11 19:46:27.330369 osafamfnd [16881:su.cc:0447] << > avnd_evt_avd_info_su_si_assign_evh: 1 > > This is not a real problem, we think we can change it as warning > The improvement could be at amfnd side for non-headless code base, so that it > can flexibly handle unexpected extra su_si event. #1386 has found this kind > of limitation of amfnd where it could not handle another su si event as it > expected, it's due to multiple failures happening along with admin command. > > > > > > > > <1620_amfd_add_su_op_list_delayed_sidep.diff><1620_amfd_adjust_interm_admin_state.diff><1620_amfd_ignore_nonexisted_csi.diff><1620_amfnd_dont_disabled_healthy_su.diff><fix.patch> ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel