Re: [devel] [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2

Gary Lee Sun, 21 Feb 2016 19:56:41 -0800

Hi Nagu

Please ignore fix.patch Minh previously sent, and replace with this version.




Thanks
Gary

> On 19 Feb 2016, at 8:09 PM, minh chau <minh.c...@dektech.com.au> wrote:
> 
> Hi Nagu,
> 
> Thanks for your testing.
> Below is our investigation from TC1 - TC31 which seem to be important, plus 
> some patches that we're trying to fix the issues
> 
> 1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11)
> Discussion is on-going. When we hit the limitation, which causes mismatch of 
> objects in amfd/imm vs amfnd:
> - amfd's trying to tolerate the mismatch or the worst case that amfd orders 
> node reboot to the last payload in cluster to avoid amfd cyclic crash
> 
> 2. Suspicious setting SU oper state in amfnd (TC #13, #24)
> According the trace in TC24, after unlock-in nodegroup, amfnd change SU oper 
> state to DISABLED, which is wrong since no fault happens on SU.
> We can raise ticket on non-headless code base. Though we think the patch 
> 1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now
> TC#13 has similar problem, but provided trace is only for amfd, so we don't 
> know where amfnd changed SU oper state to DISABLED. And we haven't been able 
> to reproduce this problem in TC#13 so far
> 
> 3. Problem in TC #16 
> It seems the fault lies in the base code when the system is not headless. The 
> admin state should not stay in unlocked state. We can raise a ticket  on the 
> current non-headless code later.
> 
> 4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, #22, #26)
> Have patch for this, please apply fix.patch
> 
> 5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14, #17, #19, 
> #20, #21)
> Have patch for this, please apply fix.patch
> 
> 6. Support Nodegroup handling in delayed failover (TC #23)
> At the time we developed AMF cloud resilience, we haven't had nodegroup 
> pushed. So we missed it.
> Please apply the patch 1620_amfd_adjust_interm_admin_state.diff
> 
> 7. Problem in TC #25
> We think it's not really a fault. Please see our opinion at the end of email
> 
> 8. Delayed failover needs to check csi level (TC #27)
> Fault reproducible, however it seems a rare use case where user creates extra 
> csi just before decides to go headless. Fix is on going
> 
> 9. Recover non existed csi (TC #28)
> CSI had been deleted in IMM, but there's delay at application so its 
> assignment object is still in amfnd at the time recovery. The patch just 
> ignores to re-create this non-existed csi, please apply 
> 1620_amfd_ignore_nonexisted_csi.diff
> 
> 10. Delayed si dep issue (TC #29, #30)
> Have patch for this, please apply 1620_amfd_add_su_op_list_delayed_sidep.diff
> 
> 11. About TC #31, test case has fault ?
> The trace shows A sponsor C, the test lock B and expect C is removed
> Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR 
> safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51)
> 
> Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14 should be 
> fixed by attached patches)
> 
> The tests reported on TC #32 to TC #40 on Npm, Nway, that we haven't planned 
> to support it since haven't seen headless user using those model, so they 
> should be buggy. We added this limitation to README, or it should be an 
> enhancement in future.
> 
> So we're still working on some remaining TCs (#2, #3, #4, #15, #27, #41, 
> #42), alarm/notification related issues, and testing the patches.
> If you find any other problems (or if you are not too busy to help reproduce 
> TC#2, #12, #13, #14), the traces are much helpful though it would be nicer 
> that we can have your testing model? (so we can quickly know which attr is 
> on/off)
> 
> Thanks,
> Minh
> 
> 
> Opinion on TC#25
> ------------------------
> TC25:
> - At the time SC1 restarts, amfd adjusts the assignment. amfd decides to 
> remove QUIESCED assignment of safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, 
> because the gdb still hangs the
> quiesced csi_set_callback 
> 
> Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >> 
> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> 
> - amfnd-PL3 receives su_si_del msg, buffer it
> Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >> 
> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >> avnd_su_siq_rec_buf: 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >> avnd_su_siq_rec_add: 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> 
> - the gdb releases csi_set_callback, the quiesced assignment sequence 
> continues, it's finished and report to amfd
> Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >> 
> avnd_di_susi_resp_send: Sending Resp 
> su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, si=safSi=AmfDemo,safApp=AmfDemo1, 
> curr_state=3, prv_state=1
> 
> Then amfnd pulls out the su_si_del which is buffered and continue the removal 
> assignment sequence. This sequence finishes and amfnd report to amfd
> Feb 11 19:46:27.329483 osafamfnd [16881:di.cc:0857] TR Sending. msg_id'3', 
> node_id'131855', msg_act'4', su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', 
> si'', ha_state'3', error'1', single_csi'0'
> 
> - At amfd, upon receiving the report of quiesced assignment completion, amfd 
> decides to remove
> quiesced assignment of SU1
> Feb 11 19:46:27.086796 osafamfd [28309:sgproc.cc:2328] >> 
> avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> 
> - So amfnd on PL3 receives another extra su_si_del msg, logs error
> Feb 11 19:46:27.330333 osafamfnd [16881:su.cc:0376] >> 
> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> Feb 11 19:46:27.330359 osafamfnd [16881:su.cc:0425] ER susi_assign_evh: 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' has no assignments
> Feb 11 19:46:27.330369 osafamfnd [16881:su.cc:0447] << 
> avnd_evt_avd_info_su_si_assign_evh: 1
> 
> This is not a real problem, we think we can change it as warning
> The improvement could be at amfnd side for non-headless code base, so that it 
> can flexibly handle unexpected extra su_si event. #1386 has found this kind 
> of limitation of amfnd where it could not handle another su si event as it 
> expected, it's due to multiple failures happening along with admin command.
> 
> 
> 
> 
> 
> 
> 
> <1620_amfd_add_su_op_list_delayed_sidep.diff><1620_amfd_adjust_interm_admin_state.diff><1620_amfd_ignore_nonexisted_csi.diff><1620_amfnd_dont_disabled_healthy_su.diff><fix.patch>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2

Reply via email to