Hi Minh, In the description of the ticket there is a log which is : " Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO Assigned 'safSi=TestApp_SI4,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO avnd_di_susi_resp_send() deferred as AMF director is offline " Last line in above log means AMFND was sending the message when it new about SC absence state. I think this issue is already fixed during #1725 and this published patch is not required. Why? After led set message amfnd will anyway send this message.
The logs that I have attached can be ignored. I was simulating the bug on different assumptions. One question regarding the patch: If the goal is to fix the issue when the message is being sent and system has become SC-less. In this situation, then avnd_mds_send() will return, most probably, a failure as MDS will not find the destination. In mds failure case, rec->no_retries will not be incremented and will remain zero. Now AMFND will process down of SC and it will call avnd_diq_del(). In this function, since no_retries is zero for this message(first message), the message will be deleted. Thanks, Praveen On 18-May-17 9:14 AM, minh chau wrote: > Hi Praveen, > > Do you have any idea why @is_avd_down was false that made amfnd to send > susi_resp at 12:37:20.453974? > It should be true by the end of avnd_evt_mds_avd_dn_evh() at > 12:37:16.741518, is it right? > > Thanks, > Minh > On 17/05/17 21:31, minh chau wrote: >> Hi Praveen, >> >> Thanks for looking at the issue. >> Here is what I am observing >> >> amfnd-PL3 received NCSMDS_DOWN indicating no active amfd >> >> May 17 12:37:16.741308 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0629] >> >> avnd_evt_mds_avd_dn_evh >> May 17 12:37:16.741342 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0651] >> WA AMF director unexpectedly crashed >> May 17 12:37:16.741354 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0701] >> TR Delete all pending messages to be sent to AMFD >> May 17 12:37:16.741379 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0709] >> NO Checking 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages >> May 17 12:37:16.741405 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0709] >> NO Checking 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' for pending >> messages >> May 17 12:37:16.741430 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0709] >> NO Checking 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' for pending >> messages >> May 17 12:37:16.741505 osafamfnd [8141:8141:src/amf/amfnd/tmr.cc:0083] >> TR SC absence timer started >> May 17 12:37:16.741518 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0742] >> << avnd_evt_mds_avd_dn_evh >> >> But a bit later, susi got assigned, and amfnd-PL3 did send this susi >> response (it should not send out and buffer it, since the @is_avd_down >> should be true) >> >> May 17 12:37:20.453974 osafamfnd [8141:8141:src/amf/amfnd/di.cc:0866] >> >> avnd_di_susi_resp_send: Sending Resp >> su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, >> si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1 >> ... >> May 17 12:37:20.454083 osafamfnd [8141:8141:src/amf/amfnd/mds.cc:1482] >> >> avnd_mds_send: Msg type '1' >> May 17 12:37:20.454244 osafamfnd [8141:8141:src/amf/amfnd/mds.cc:1537] >> ER ncsmds_api for 0 FAILED, dest=0 >> >> When SC1 restarted, amfd received the very first messages from PL3 >> starting with msg_id=1 (it should be starting from 0) >> >> May 17 12:37:28.398633 osafamfd >> [7686:7686:src/amf/amfd/ndproc.cc:0330] NO Receive message with event >> type:12, msg_type:31, from node:2030f, msg_id:1 >> May 17 12:37:28.413018 osafamfd [7686:7686:src/amf/amfd/ndfsm.cc:0334] >> NO Received node_up_msg from all nodes >> May 17 12:37:28.413069 osafamfd [7686:7686:src/amf/amfd/ndfsm.cc:0254] >> NO Received node_up from 2030f: msg_id 2 >> >> Looks to me something should not happen inside >> avnd_evt_mds_avd_dn_evh(). In this avnd_evt_mds_avd_dn_evh(), >> @is_avd_down should be true, the msg counter should be reset to 0, but >> I do see the SC absence timer started. I couldn't figure how it >> happened for now >> >> Thanks, >> Minh >> >> On 17/05/17 20:03, praveen malviya wrote: >>> What I see is avnd_diq_del() is called as soon as system becomes >>> headless. This will delete all pending messages. But when component >>> will respond during SCs absence a new message will be generated and >>> buffered. >>> For node_up AMFD will ack the message, but amfnd calls >>> avnd_diq_rec_del() (not avnd_diq_del()) in avnd_di_msg_ack_process(). >>> We need to call avnd_diq_del() in ack message so that msg_id gets >>> updated. >>> Further looking into it.. >>> >>> >>> Thanks. >>> Praveen >>> >>> >>> >>> On 17-May-17 1:50 PM, praveen malviya wrote: >>>> Hi Minh, >>>> >>>> While testing this, I am observing that amfd is dropping the assignment >>>> message because of message id mismatch: >>>> May 17 12:37:39.522117 osafamfd [7686:7686:src/amf/amfd/sgproc.cc:1171] >>>> >> avd_su_si_assign_evh: id:1, node:2030f, act:5, >>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', '', ha:3, err:1, single:0 >>>> .... >>>> .... >>>> May 17 12:37:39.522404 osafamfd [7686:7686:src/amf/amfd/ndproc.cc:0075] >>>> WA avd_msg_sanity_chk: invalid msg id 1, msg type 5, from 2030f >>>> should be 3 >>>> May 17 12:37:39.522418 osafamfd [7686:7686:src/amf/amfd/sgproc.cc:1777] >>>> << avd_su_si_assign_evh >>>> >>>> I am also looking into this. For your reference I had attached amfd and >>>> amfnd traces from SC-1 and PL-3 respectively in the ticket. >>>> I am working with one controller and one payload. >>>> >>>> >>>> Thanks >>>> Praveen >>>> >>>> On 15-May-17 1:06 PM, Minh Chau wrote: >>>>> When amfnd-payload responds susi assignment response just before >>>>> both SC >>>>> go down, and that response message does not come to director. >>>>> Therefore, >>>>> the status of that assignment could be seen as "modifying" in IMM. >>>>> When >>>>> SC comes back, active amfd will be waiting for that response forever. >>>>> >>>>> Patch checks if a susi assignment response is sent but not-ack just >>>>> before >>>>> both SC come down, amfnd-payload will buffer it in a way as a susi get >>>>> assigned during SC absence >>>>> --- >>>>> src/amf/amfnd/di.cc | 53 >>>>> +++++++++++++++++++++++++++++++++++++++++++++-------- >>>>> 1 file changed, 45 insertions(+), 8 deletions(-) >>>>> >>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc >>>>> index e06b9260d..3776a09dc 100644 >>>>> --- a/src/amf/amfnd/di.cc >>>>> +++ b/src/amf/amfnd/di.cc >>>>> @@ -1282,16 +1282,53 @@ void avnd_di_msg_ack_process(AVND_CB *cb, >>>>> uint32_t mid) { >>>>> Notes : None. >>>>> ******************************************************************************/ >>>>> >>>>> >>>>> void avnd_diq_del(AVND_CB *cb) { >>>>> - AVND_DND_MSG_LIST *rec = 0; >>>>> - do { >>>>> - /* pop the record */ >>>>> - m_AVND_DIQ_REC_POP(cb, rec); >>>>> - if (!rec) break; >>>>> + if ((cb->dnd_list.head != nullptr)) { >>>>> + AVND_DND_MSG_LIST *rec = 0; >>>>> + bool found = true; >>>>> + while (found) { >>>>> + found = false; >>>>> + for (rec = cb->dnd_list.head; rec != nullptr; >>>>> + rec = rec->next) { >>>>> + osafassert(rec->msg.type == AVND_MSG_AVD); >>>>> + // delete all pending messages that haven't been sent out >>>>> + if (rec->no_retries == 0) { >>>>> + m_AVND_DIQ_REC_POP(cb, rec); >>>>> + avnd_diq_rec_del(cb, rec); >>>>> + break; >>>>> + } else { >>>>> + // Assignment response had been sent, but not ack >>>>> because last >>>>> + // controller go down, reset msg_id and will be resent >>>>> later >>>>> + if (rec->msg.info.avd->msg_type == >>>>> AVSV_N2D_INFO_SU_SI_ASSIGN_MSG) { >>>>> + if >>>>> (rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id != 0) { >>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id = 0; >>>>> + found = true; >>>>> + LOG_NO( >>>>> + "Found not-ack su_si_assign msg for SU:'%s', " >>>>> + "SI:'%s', ha_state:'%u', msg_act:'%u', >>>>> single_csi:'%u', " >>>>> + "error:'%u', msg_id:'%u'", >>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info >>>>> + .n2d_su_si_assign.su_name), >>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info >>>>> + .n2d_su_si_assign.si_name), >>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.ha_state, >>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_act, >>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign >>>>> + .single_csi, >>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.error, >>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id); >>>>> + } >>>>> + } else { >>>>> + // delete other messages for now >>>>> + m_AVND_DIQ_REC_POP(cb, rec); >>>>> + avnd_diq_rec_del(cb, rec); >>>>> + break; >>>>> + } >>>>> + } >>>>> - /* delete the record */ >>>>> - avnd_diq_rec_del(cb, rec); >>>>> - } while (1); >>>>> + } >>>>> + } >>>>> + } >>>>> return; >>>>> } >>>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, Slashdot.org! >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=Luyb_FCgTEXSpVle_diQMuhKxVmmm6cmv5VA03k0Zu8&e= >>>> >>>> >>>> _______________________________________________ >>>> Opensaf-devel mailing list >>>> Opensaf-devel@lists.sourceforge.net >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=e4sg0J1cdg4VnTqeWPDrNZlPv2BuIuFj4Dk7JACxgx8&e= >>>> >>>> >>>> >>> >> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel