Hi Minh, I had analysed the traces you attached. Based on that I am able to test that. When MDS returns success patch works fine. Minor correction is needed when MDS return failure. I think susi message should be kept independent of no. of tries in avnd_diq_del(). Thanks Praveen
On 18-May-17 12:41 PM, minh chau wrote: > Hi Praveen, > > Some comments in line with [Minh] > > thanks, > Minh > > On 18/05/17 14:54, praveen malviya wrote: >> Hi Minh, >> >> In the description of the ticket there is a log which is : >> " >> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO Assigned >> 'safSi=TestApp_SI4,safApp=TestApp_TwoN' ACTIVE to >> 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' >> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO >> avnd_di_susi_resp_send() deferred as AMF director is offline >> " >> Last line in above log means AMFND was sending the message when it new >> about SC absence state. I think this issue is already fixed during >> #1725 and this published patch is not required. Why? After led set >> message amfnd will anyway send this message. > [Minh] I have reproduced the problem and attached to ticket for your > reference. > Some outlined logs: > The step is stopping SC1, SC2. > In SC2, amfd sent susi assignment req to amfnd-PL3 > May 18 16:32:03.633226 osafamfd [245:245:src/amf/amfd/sgproc.cc:2444] >> > avd_sg_su_si_mod_snd: 'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > state 1 > > In PL3, amfnd completed this susi req, and sent susi resp successfully > but it did not reach to amfd-SC2 > May 18 16:32:03.641156 osafamfnd [186:186:src/amf/amfnd/su.cc:0373] >> > avnd_evt_avd_info_su_si_assign_evh: > 'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > May 18 16:32:03.641744 osafamfnd [186:186:src/amf/amfnd/di.cc:0866] >> > avnd_di_susi_resp_send: Sending Resp > su=safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon, > si=safSi=AmfDemoTwon,safApp=AmfDemoTwon, curr_state=1, prv_state=2 > > amfnd-PL3 is notified NCSMDS_DOWN, amfnd deleted all pending msg waiting > for ack > May 18 16:32:05.568471 osafamfnd [186:186:src/amf/amfnd/di.cc:0629] >> > avnd_evt_mds_avd_dn_evh > May 18 16:32:05.568492 osafamfnd [186:186:src/amf/amfnd/di.cc:0651] WA > AMF director unexpectedly crashed > May 18 16:32:05.568495 osafamfnd [186:186:src/amf/amfnd/di.cc:0701] TR > Delete all pending messages to be sent to AMFD > May 18 16:32:05.568498 osafamfnd [186:186:src/amf/amfnd/di.cc:1353] >> > avnd_diq_rec_del > May 18 16:32:05.568503 osafamfnd [186:186:src/amf/amfnd/di.cc:1369] << > avnd_diq_rec_del > > When SC restarts, amfd-SC1 thinks this assignment being in progress, so > it waits and waits forever > May 18 16:32:28.954967 osafamfd [257:257:src/amf/amfd/su.cc:2588] >> > any_susi_fsm_in: SU:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > check_fsm:5 > May 18 16:32:28.954975 osafamfd [257:257:src/amf/amfd/su.cc:2593] TR > SUSI:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon,safSi=AmfDemoTwon,safApp=AmfDemoTwon', > > fsm:'5' > May 18 16:32:28.954982 osafamfd [257:257:src/amf/amfd/su.cc:2596] TR Found > May 18 16:32:28.954989 osafamfd [257:257:src/amf/amfd/su.cc:2599] << > any_susi_fsm_in > May 18 16:32:28.954996 osafamfd [257:257:src/amf/amfd/sg.cc:2340] << > any_assignment_in_progress > > This problem is very close to the one you mentioned and fixed in #1725. > In #1725, amfnd surely knows amfd down, so amfnd buffers msg. In #2105, > amfnd sends msg out just before amfnd detects amfd being down. >> >> The logs that I have attached can be ignored. I was simulating the bug >> on different assumptions. >> >> One question regarding the patch: >> If the goal is to fix the issue when the message is being sent and >> system has become SC-less. In this situation, then avnd_mds_send() >> will return, most probably, a failure as MDS will not find the >> destination. In mds failure case, rec->no_retries will not be >> incremented and will remain zero. Now AMFND will process down of SC >> and it will call avnd_diq_del(). In this function, since no_retries is >> zero for this message(first message), the message will be deleted. >> > [Minh]: Thanks, it's good to handle failure code returned from MDS. I > will update the patch >> >> Thanks, >> Praveen >> >> >> On 18-May-17 9:14 AM, minh chau wrote: >>> Hi Praveen, >>> >>> Do you have any idea why @is_avd_down was false that made amfnd to >>> send susi_resp at 12:37:20.453974? >>> It should be true by the end of avnd_evt_mds_avd_dn_evh() at >>> 12:37:16.741518, is it right? >>> >>> Thanks, >>> Minh >>> On 17/05/17 21:31, minh chau wrote: >>>> Hi Praveen, >>>> >>>> Thanks for looking at the issue. >>>> Here is what I am observing >>>> >>>> amfnd-PL3 received NCSMDS_DOWN indicating no active amfd >>>> >>>> May 17 12:37:16.741308 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0629] >> avnd_evt_mds_avd_dn_evh >>>> May 17 12:37:16.741342 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0651] WA AMF director unexpectedly >>>> crashed >>>> May 17 12:37:16.741354 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0701] TR Delete all pending messages >>>> to be sent to AMFD >>>> May 17 12:37:16.741379 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking >>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages >>>> May 17 12:37:16.741405 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking >>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' for pending messages >>>> May 17 12:37:16.741430 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking >>>> 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' for pending messages >>>> May 17 12:37:16.741505 osafamfnd >>>> [8141:8141:src/amf/amfnd/tmr.cc:0083] TR SC absence timer started >>>> May 17 12:37:16.741518 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0742] << avnd_evt_mds_avd_dn_evh >>>> >>>> But a bit later, susi got assigned, and amfnd-PL3 did send this susi >>>> response (it should not send out and buffer it, since the >>>> @is_avd_down should be true) >>>> >>>> May 17 12:37:20.453974 osafamfnd >>>> [8141:8141:src/amf/amfnd/di.cc:0866] >> avnd_di_susi_resp_send: >>>> Sending Resp su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, >>>> si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1 >>>> ... >>>> May 17 12:37:20.454083 osafamfnd >>>> [8141:8141:src/amf/amfnd/mds.cc:1482] >> avnd_mds_send: Msg type '1' >>>> May 17 12:37:20.454244 osafamfnd >>>> [8141:8141:src/amf/amfnd/mds.cc:1537] ER ncsmds_api for 0 FAILED, >>>> dest=0 >>>> >>>> When SC1 restarted, amfd received the very first messages from PL3 >>>> starting with msg_id=1 (it should be starting from 0) >>>> >>>> May 17 12:37:28.398633 osafamfd >>>> [7686:7686:src/amf/amfd/ndproc.cc:0330] NO Receive message with >>>> event type:12, msg_type:31, from node:2030f, msg_id:1 >>>> May 17 12:37:28.413018 osafamfd >>>> [7686:7686:src/amf/amfd/ndfsm.cc:0334] NO Received node_up_msg from >>>> all nodes >>>> May 17 12:37:28.413069 osafamfd >>>> [7686:7686:src/amf/amfd/ndfsm.cc:0254] NO Received node_up from >>>> 2030f: msg_id 2 >>>> >>>> Looks to me something should not happen inside >>>> avnd_evt_mds_avd_dn_evh(). In this avnd_evt_mds_avd_dn_evh(), >>>> @is_avd_down should be true, the msg counter should be reset to 0, >>>> but I do see the SC absence timer started. I couldn't figure how it >>>> happened for now >>>> >>>> Thanks, >>>> Minh >>>> >>>> On 17/05/17 20:03, praveen malviya wrote: >>>>> What I see is avnd_diq_del() is called as soon as system becomes >>>>> headless. This will delete all pending messages. But when component >>>>> will respond during SCs absence a new message will be generated and >>>>> buffered. >>>>> For node_up AMFD will ack the message, but amfnd calls >>>>> avnd_diq_rec_del() (not avnd_diq_del()) in avnd_di_msg_ack_process(). >>>>> We need to call avnd_diq_del() in ack message so that msg_id gets >>>>> updated. >>>>> Further looking into it.. >>>>> >>>>> >>>>> Thanks. >>>>> Praveen >>>>> >>>>> >>>>> >>>>> On 17-May-17 1:50 PM, praveen malviya wrote: >>>>>> Hi Minh, >>>>>> >>>>>> While testing this, I am observing that amfd is dropping the >>>>>> assignment >>>>>> message because of message id mismatch: >>>>>> May 17 12:37:39.522117 osafamfd >>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1171] >>>>>> >> avd_su_si_assign_evh: id:1, node:2030f, act:5, >>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', '', ha:3, err:1, single:0 >>>>>> .... >>>>>> .... >>>>>> May 17 12:37:39.522404 osafamfd >>>>>> [7686:7686:src/amf/amfd/ndproc.cc:0075] >>>>>> WA avd_msg_sanity_chk: invalid msg id 1, msg type 5, from 2030f >>>>>> should be 3 >>>>>> May 17 12:37:39.522418 osafamfd >>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1777] >>>>>> << avd_su_si_assign_evh >>>>>> >>>>>> I am also looking into this. For your reference I had attached >>>>>> amfd and >>>>>> amfnd traces from SC-1 and PL-3 respectively in the ticket. >>>>>> I am working with one controller and one payload. >>>>>> >>>>>> >>>>>> Thanks >>>>>> Praveen >>>>>> >>>>>> On 15-May-17 1:06 PM, Minh Chau wrote: >>>>>>> When amfnd-payload responds susi assignment response just before >>>>>>> both SC >>>>>>> go down, and that response message does not come to director. >>>>>>> Therefore, >>>>>>> the status of that assignment could be seen as "modifying" in >>>>>>> IMM. When >>>>>>> SC comes back, active amfd will be waiting for that response >>>>>>> forever. >>>>>>> >>>>>>> Patch checks if a susi assignment response is sent but not-ack >>>>>>> just before >>>>>>> both SC come down, amfnd-payload will buffer it in a way as a >>>>>>> susi get >>>>>>> assigned during SC absence >>>>>>> --- >>>>>>> src/amf/amfnd/di.cc | 53 >>>>>>> +++++++++++++++++++++++++++++++++++++++++++++-------- >>>>>>> 1 file changed, 45 insertions(+), 8 deletions(-) >>>>>>> >>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc >>>>>>> index e06b9260d..3776a09dc 100644 >>>>>>> --- a/src/amf/amfnd/di.cc >>>>>>> +++ b/src/amf/amfnd/di.cc >>>>>>> @@ -1282,16 +1282,53 @@ void avnd_di_msg_ack_process(AVND_CB *cb, >>>>>>> uint32_t mid) { >>>>>>> Notes : None. >>>>>>> ******************************************************************************/ >>>>>>> >>>>>>> >>>>>>> void avnd_diq_del(AVND_CB *cb) { >>>>>>> - AVND_DND_MSG_LIST *rec = 0; >>>>>>> - do { >>>>>>> - /* pop the record */ >>>>>>> - m_AVND_DIQ_REC_POP(cb, rec); >>>>>>> - if (!rec) break; >>>>>>> + if ((cb->dnd_list.head != nullptr)) { >>>>>>> + AVND_DND_MSG_LIST *rec = 0; >>>>>>> + bool found = true; >>>>>>> + while (found) { >>>>>>> + found = false; >>>>>>> + for (rec = cb->dnd_list.head; rec != nullptr; >>>>>>> + rec = rec->next) { >>>>>>> + osafassert(rec->msg.type == AVND_MSG_AVD); >>>>>>> + // delete all pending messages that haven't been sent out >>>>>>> + if (rec->no_retries == 0) { >>>>>>> + m_AVND_DIQ_REC_POP(cb, rec); >>>>>>> + avnd_diq_rec_del(cb, rec); >>>>>>> + break; >>>>>>> + } else { >>>>>>> + // Assignment response had been sent, but not ack >>>>>>> because last >>>>>>> + // controller go down, reset msg_id and will be resent >>>>>>> later >>>>>>> + if (rec->msg.info.avd->msg_type == >>>>>>> AVSV_N2D_INFO_SU_SI_ASSIGN_MSG) { >>>>>>> + if >>>>>>> (rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id != 0) { >>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id = 0; >>>>>>> + found = true; >>>>>>> + LOG_NO( >>>>>>> + "Found not-ack su_si_assign msg for SU:'%s', " >>>>>>> + "SI:'%s', ha_state:'%u', msg_act:'%u', >>>>>>> single_csi:'%u', " >>>>>>> + "error:'%u', msg_id:'%u'", >>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info >>>>>>> + .n2d_su_si_assign.su_name), >>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info >>>>>>> + .n2d_su_si_assign.si_name), >>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.ha_state, >>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_act, >>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign >>>>>>> + .single_csi, >>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.error, >>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id); >>>>>>> + } >>>>>>> + } else { >>>>>>> + // delete other messages for now >>>>>>> + m_AVND_DIQ_REC_POP(cb, rec); >>>>>>> + avnd_diq_rec_del(cb, rec); >>>>>>> + break; >>>>>>> + } >>>>>>> + } >>>>>>> - /* delete the record */ >>>>>>> - avnd_diq_rec_del(cb, rec); >>>>>>> - } while (1); >>>>>>> + } >>>>>>> + } >>>>>>> + } >>>>>>> return; >>>>>>> } >>>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> >>>>>> Check out the vibrant tech community on one of the world's most >>>>>> engaging tech sites, Slashdot.org! >>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=Luyb_FCgTEXSpVle_diQMuhKxVmmm6cmv5VA03k0Zu8&e= >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Opensaf-devel mailing list >>>>>> Opensaf-devel@lists.sourceforge.net >>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=e4sg0J1cdg4VnTqeWPDrNZlPv2BuIuFj4Dk7JACxgx8&e= >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel