Hi Praveen, I have sent out V2: - If we don't care no_retries in avnd_diq_del(), then I think checking mds failure code can be ignored - Add checking for su_oper msg, since amfd also waits for this message to orchestrate recovery
Thanks, Minh On 18/05/17 19:38, praveen malviya wrote: > Hi Minh, > > I had analysed the traces you attached. > Based on that I am able to test that. When MDS returns success patch > works fine. > Minor correction is needed when MDS return failure. > I think susi message should be kept independent of no. of tries in > avnd_diq_del(). > Thanks > Praveen > > > On 18-May-17 12:41 PM, minh chau wrote: >> Hi Praveen, >> >> Some comments in line with [Minh] >> >> thanks, >> Minh >> >> On 18/05/17 14:54, praveen malviya wrote: >>> Hi Minh, >>> >>> In the description of the ticket there is a log which is : >>> " >>> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO Assigned >>> 'safSi=TestApp_SI4,safApp=TestApp_TwoN' ACTIVE to >>> 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' >>> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO >>> avnd_di_susi_resp_send() deferred as AMF director is offline >>> " >>> Last line in above log means AMFND was sending the message when it >>> new about SC absence state. I think this issue is already fixed >>> during #1725 and this published patch is not required. Why? After >>> led set message amfnd will anyway send this message. >> [Minh] I have reproduced the problem and attached to ticket for your >> reference. >> Some outlined logs: >> The step is stopping SC1, SC2. >> In SC2, amfd sent susi assignment req to amfnd-PL3 >> May 18 16:32:03.633226 osafamfd [245:245:src/amf/amfd/sgproc.cc:2444] >> >> avd_sg_su_si_mod_snd: >> 'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon', state 1 >> >> In PL3, amfnd completed this susi req, and sent susi resp >> successfully but it did not reach to amfd-SC2 >> May 18 16:32:03.641156 osafamfnd [186:186:src/amf/amfnd/su.cc:0373] >> >> avnd_evt_avd_info_su_si_assign_evh: >> 'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon' >> May 18 16:32:03.641744 osafamfnd [186:186:src/amf/amfnd/di.cc:0866] >> >> avnd_di_susi_resp_send: Sending Resp >> su=safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon, >> si=safSi=AmfDemoTwon,safApp=AmfDemoTwon, curr_state=1, prv_state=2 >> >> amfnd-PL3 is notified NCSMDS_DOWN, amfnd deleted all pending msg >> waiting for ack >> May 18 16:32:05.568471 osafamfnd [186:186:src/amf/amfnd/di.cc:0629] >> >> avnd_evt_mds_avd_dn_evh >> May 18 16:32:05.568492 osafamfnd [186:186:src/amf/amfnd/di.cc:0651] >> WA AMF director unexpectedly crashed >> May 18 16:32:05.568495 osafamfnd [186:186:src/amf/amfnd/di.cc:0701] >> TR Delete all pending messages to be sent to AMFD >> May 18 16:32:05.568498 osafamfnd [186:186:src/amf/amfnd/di.cc:1353] >> >> avnd_diq_rec_del >> May 18 16:32:05.568503 osafamfnd [186:186:src/amf/amfnd/di.cc:1369] >> << avnd_diq_rec_del >> >> When SC restarts, amfd-SC1 thinks this assignment being in progress, >> so it waits and waits forever >> May 18 16:32:28.954967 osafamfd [257:257:src/amf/amfd/su.cc:2588] >> >> any_susi_fsm_in: SU:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon', >> check_fsm:5 >> May 18 16:32:28.954975 osafamfd [257:257:src/amf/amfd/su.cc:2593] TR >> SUSI:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon,safSi=AmfDemoTwon,safApp=AmfDemoTwon', >> >> fsm:'5' >> May 18 16:32:28.954982 osafamfd [257:257:src/amf/amfd/su.cc:2596] TR >> Found >> May 18 16:32:28.954989 osafamfd [257:257:src/amf/amfd/su.cc:2599] << >> any_susi_fsm_in >> May 18 16:32:28.954996 osafamfd [257:257:src/amf/amfd/sg.cc:2340] << >> any_assignment_in_progress >> >> This problem is very close to the one you mentioned and fixed in >> #1725. In #1725, amfnd surely knows amfd down, so amfnd buffers msg. >> In #2105, amfnd sends msg out just before amfnd detects amfd being down. >>> >>> The logs that I have attached can be ignored. I was simulating the >>> bug on different assumptions. >>> >>> One question regarding the patch: >>> If the goal is to fix the issue when the message is being sent and >>> system has become SC-less. In this situation, then avnd_mds_send() >>> will return, most probably, a failure as MDS will not find the >>> destination. In mds failure case, rec->no_retries will not be >>> incremented and will remain zero. Now AMFND will process down of SC >>> and it will call avnd_diq_del(). In this function, since no_retries >>> is zero for this message(first message), the message will be deleted. >>> >> [Minh]: Thanks, it's good to handle failure code returned from MDS. I >> will update the patch >>> >>> Thanks, >>> Praveen >>> >>> >>> On 18-May-17 9:14 AM, minh chau wrote: >>>> Hi Praveen, >>>> >>>> Do you have any idea why @is_avd_down was false that made amfnd to >>>> send susi_resp at 12:37:20.453974? >>>> It should be true by the end of avnd_evt_mds_avd_dn_evh() at >>>> 12:37:16.741518, is it right? >>>> >>>> Thanks, >>>> Minh >>>> On 17/05/17 21:31, minh chau wrote: >>>>> Hi Praveen, >>>>> >>>>> Thanks for looking at the issue. >>>>> Here is what I am observing >>>>> >>>>> amfnd-PL3 received NCSMDS_DOWN indicating no active amfd >>>>> >>>>> May 17 12:37:16.741308 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0629] >> avnd_evt_mds_avd_dn_evh >>>>> May 17 12:37:16.741342 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0651] WA AMF director unexpectedly >>>>> crashed >>>>> May 17 12:37:16.741354 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0701] TR Delete all pending >>>>> messages to be sent to AMFD >>>>> May 17 12:37:16.741379 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking >>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages >>>>> May 17 12:37:16.741405 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking >>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' for pending messages >>>>> May 17 12:37:16.741430 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking >>>>> 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' for pending messages >>>>> May 17 12:37:16.741505 osafamfnd >>>>> [8141:8141:src/amf/amfnd/tmr.cc:0083] TR SC absence timer started >>>>> May 17 12:37:16.741518 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0742] << avnd_evt_mds_avd_dn_evh >>>>> >>>>> But a bit later, susi got assigned, and amfnd-PL3 did send this >>>>> susi response (it should not send out and buffer it, since the >>>>> @is_avd_down should be true) >>>>> >>>>> May 17 12:37:20.453974 osafamfnd >>>>> [8141:8141:src/amf/amfnd/di.cc:0866] >> avnd_di_susi_resp_send: >>>>> Sending Resp su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, >>>>> si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1 >>>>> ... >>>>> May 17 12:37:20.454083 osafamfnd >>>>> [8141:8141:src/amf/amfnd/mds.cc:1482] >> avnd_mds_send: Msg type '1' >>>>> May 17 12:37:20.454244 osafamfnd >>>>> [8141:8141:src/amf/amfnd/mds.cc:1537] ER ncsmds_api for 0 FAILED, >>>>> dest=0 >>>>> >>>>> When SC1 restarted, amfd received the very first messages from PL3 >>>>> starting with msg_id=1 (it should be starting from 0) >>>>> >>>>> May 17 12:37:28.398633 osafamfd >>>>> [7686:7686:src/amf/amfd/ndproc.cc:0330] NO Receive message with >>>>> event type:12, msg_type:31, from node:2030f, msg_id:1 >>>>> May 17 12:37:28.413018 osafamfd >>>>> [7686:7686:src/amf/amfd/ndfsm.cc:0334] NO Received node_up_msg >>>>> from all nodes >>>>> May 17 12:37:28.413069 osafamfd >>>>> [7686:7686:src/amf/amfd/ndfsm.cc:0254] NO Received node_up from >>>>> 2030f: msg_id 2 >>>>> >>>>> Looks to me something should not happen inside >>>>> avnd_evt_mds_avd_dn_evh(). In this avnd_evt_mds_avd_dn_evh(), >>>>> @is_avd_down should be true, the msg counter should be reset to 0, >>>>> but I do see the SC absence timer started. I couldn't figure how >>>>> it happened for now >>>>> >>>>> Thanks, >>>>> Minh >>>>> >>>>> On 17/05/17 20:03, praveen malviya wrote: >>>>>> What I see is avnd_diq_del() is called as soon as system becomes >>>>>> headless. This will delete all pending messages. But when >>>>>> component will respond during SCs absence a new message will be >>>>>> generated and buffered. >>>>>> For node_up AMFD will ack the message, but amfnd calls >>>>>> avnd_diq_rec_del() (not avnd_diq_del()) in >>>>>> avnd_di_msg_ack_process(). >>>>>> We need to call avnd_diq_del() in ack message so that msg_id gets >>>>>> updated. >>>>>> Further looking into it.. >>>>>> >>>>>> >>>>>> Thanks. >>>>>> Praveen >>>>>> >>>>>> >>>>>> >>>>>> On 17-May-17 1:50 PM, praveen malviya wrote: >>>>>>> Hi Minh, >>>>>>> >>>>>>> While testing this, I am observing that amfd is dropping the >>>>>>> assignment >>>>>>> message because of message id mismatch: >>>>>>> May 17 12:37:39.522117 osafamfd >>>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1171] >>>>>>> >> avd_su_si_assign_evh: id:1, node:2030f, act:5, >>>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', '', ha:3, err:1, >>>>>>> single:0 >>>>>>> .... >>>>>>> .... >>>>>>> May 17 12:37:39.522404 osafamfd >>>>>>> [7686:7686:src/amf/amfd/ndproc.cc:0075] >>>>>>> WA avd_msg_sanity_chk: invalid msg id 1, msg type 5, from 2030f >>>>>>> should be 3 >>>>>>> May 17 12:37:39.522418 osafamfd >>>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1777] >>>>>>> << avd_su_si_assign_evh >>>>>>> >>>>>>> I am also looking into this. For your reference I had attached >>>>>>> amfd and >>>>>>> amfnd traces from SC-1 and PL-3 respectively in the ticket. >>>>>>> I am working with one controller and one payload. >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Praveen >>>>>>> >>>>>>> On 15-May-17 1:06 PM, Minh Chau wrote: >>>>>>>> When amfnd-payload responds susi assignment response just >>>>>>>> before both SC >>>>>>>> go down, and that response message does not come to director. >>>>>>>> Therefore, >>>>>>>> the status of that assignment could be seen as "modifying" in >>>>>>>> IMM. When >>>>>>>> SC comes back, active amfd will be waiting for that response >>>>>>>> forever. >>>>>>>> >>>>>>>> Patch checks if a susi assignment response is sent but not-ack >>>>>>>> just before >>>>>>>> both SC come down, amfnd-payload will buffer it in a way as a >>>>>>>> susi get >>>>>>>> assigned during SC absence >>>>>>>> --- >>>>>>>> src/amf/amfnd/di.cc | 53 >>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++-------- >>>>>>>> 1 file changed, 45 insertions(+), 8 deletions(-) >>>>>>>> >>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc >>>>>>>> index e06b9260d..3776a09dc 100644 >>>>>>>> --- a/src/amf/amfnd/di.cc >>>>>>>> +++ b/src/amf/amfnd/di.cc >>>>>>>> @@ -1282,16 +1282,53 @@ void avnd_di_msg_ack_process(AVND_CB >>>>>>>> *cb, uint32_t mid) { >>>>>>>> Notes : None. >>>>>>>> ******************************************************************************/ >>>>>>>> >>>>>>>> >>>>>>>> void avnd_diq_del(AVND_CB *cb) { >>>>>>>> - AVND_DND_MSG_LIST *rec = 0; >>>>>>>> - do { >>>>>>>> - /* pop the record */ >>>>>>>> - m_AVND_DIQ_REC_POP(cb, rec); >>>>>>>> - if (!rec) break; >>>>>>>> + if ((cb->dnd_list.head != nullptr)) { >>>>>>>> + AVND_DND_MSG_LIST *rec = 0; >>>>>>>> + bool found = true; >>>>>>>> + while (found) { >>>>>>>> + found = false; >>>>>>>> + for (rec = cb->dnd_list.head; rec != nullptr; >>>>>>>> + rec = rec->next) { >>>>>>>> + osafassert(rec->msg.type == AVND_MSG_AVD); >>>>>>>> + // delete all pending messages that haven't been sent out >>>>>>>> + if (rec->no_retries == 0) { >>>>>>>> + m_AVND_DIQ_REC_POP(cb, rec); >>>>>>>> + avnd_diq_rec_del(cb, rec); >>>>>>>> + break; >>>>>>>> + } else { >>>>>>>> + // Assignment response had been sent, but not ack >>>>>>>> because last >>>>>>>> + // controller go down, reset msg_id and will be >>>>>>>> resent later >>>>>>>> + if (rec->msg.info.avd->msg_type == >>>>>>>> AVSV_N2D_INFO_SU_SI_ASSIGN_MSG) { >>>>>>>> + if >>>>>>>> (rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id != 0) { >>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id = 0; >>>>>>>> + found = true; >>>>>>>> + LOG_NO( >>>>>>>> + "Found not-ack su_si_assign msg for SU:'%s', " >>>>>>>> + "SI:'%s', ha_state:'%u', msg_act:'%u', >>>>>>>> single_csi:'%u', " >>>>>>>> + "error:'%u', msg_id:'%u'", >>>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info >>>>>>>> + .n2d_su_si_assign.su_name), >>>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info >>>>>>>> + .n2d_su_si_assign.si_name), >>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.ha_state, >>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_act, >>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign >>>>>>>> + .single_csi, >>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.error, >>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id); >>>>>>>> + } >>>>>>>> + } else { >>>>>>>> + // delete other messages for now >>>>>>>> + m_AVND_DIQ_REC_POP(cb, rec); >>>>>>>> + avnd_diq_rec_del(cb, rec); >>>>>>>> + break; >>>>>>>> + } >>>>>>>> + } >>>>>>>> - /* delete the record */ >>>>>>>> - avnd_diq_rec_del(cb, rec); >>>>>>>> - } while (1); >>>>>>>> + } >>>>>>>> + } >>>>>>>> + } >>>>>>>> return; >>>>>>>> } >>>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> >>>>>>> >>>>>>> Check out the vibrant tech community on one of the world's most >>>>>>> engaging tech sites, Slashdot.org! >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=Luyb_FCgTEXSpVle_diQMuhKxVmmm6cmv5VA03k0Zu8&e= >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Opensaf-devel mailing list >>>>>>> Opensaf-devel@lists.sourceforge.net >>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=e4sg0J1cdg4VnTqeWPDrNZlPv2BuIuFj4Dk7JACxgx8&e= >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel