Hi Praveen,

I have sent out V2:
- If we don't care no_retries in avnd_diq_del(), then I think checking 
mds failure code can be ignored
- Add checking for su_oper msg, since amfd also waits for this message 
to orchestrate recovery

Thanks,
Minh

On 18/05/17 19:38, praveen malviya wrote:
> Hi Minh,
>
> I had analysed the traces you attached.
> Based on that I am able to test that. When MDS returns success patch 
> works fine.
> Minor correction is needed when MDS return failure.
> I think susi message should be kept independent of no. of tries in 
> avnd_diq_del().
> Thanks
> Praveen
>
>
> On 18-May-17 12:41 PM, minh chau wrote:
>> Hi Praveen,
>>
>> Some comments in line with [Minh]
>>
>> thanks,
>> Minh
>>
>> On 18/05/17 14:54, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> In the description of the ticket there is a log which is :
>>> "
>>> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO Assigned 
>>> 'safSi=TestApp_SI4,safApp=TestApp_TwoN' ACTIVE to 
>>> 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN'
>>> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO 
>>> avnd_di_susi_resp_send() deferred as AMF director is offline
>>> "
>>> Last line in above log means AMFND was sending the message when it 
>>> new about SC absence state. I think this issue is already fixed 
>>> during #1725 and this published patch is not required. Why? After 
>>> led set message amfnd will anyway send this message.
>> [Minh] I have reproduced the problem and attached to ticket for your 
>> reference.
>> Some outlined logs:
>> The step is stopping SC1, SC2.
>> In SC2, amfd sent susi assignment req to amfnd-PL3
>> May 18 16:32:03.633226 osafamfd [245:245:src/amf/amfd/sgproc.cc:2444] 
>> >> avd_sg_su_si_mod_snd: 
>> 'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon', state 1
>>
>> In PL3, amfnd completed this susi req, and sent susi resp 
>> successfully but it did not reach to amfd-SC2
>> May 18 16:32:03.641156 osafamfnd [186:186:src/amf/amfnd/su.cc:0373] 
>> >> avnd_evt_avd_info_su_si_assign_evh: 
>> 'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon'
>> May 18 16:32:03.641744 osafamfnd [186:186:src/amf/amfnd/di.cc:0866] 
>> >> avnd_di_susi_resp_send: Sending Resp 
>> su=safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon, 
>> si=safSi=AmfDemoTwon,safApp=AmfDemoTwon, curr_state=1, prv_state=2
>>
>> amfnd-PL3 is notified NCSMDS_DOWN, amfnd deleted all pending msg 
>> waiting for ack
>> May 18 16:32:05.568471 osafamfnd [186:186:src/amf/amfnd/di.cc:0629] 
>> >> avnd_evt_mds_avd_dn_evh
>> May 18 16:32:05.568492 osafamfnd [186:186:src/amf/amfnd/di.cc:0651] 
>> WA AMF director unexpectedly crashed
>> May 18 16:32:05.568495 osafamfnd [186:186:src/amf/amfnd/di.cc:0701] 
>> TR Delete all pending messages to be sent to AMFD
>> May 18 16:32:05.568498 osafamfnd [186:186:src/amf/amfnd/di.cc:1353] 
>> >> avnd_diq_rec_del
>> May 18 16:32:05.568503 osafamfnd [186:186:src/amf/amfnd/di.cc:1369] 
>> << avnd_diq_rec_del
>>
>> When SC restarts, amfd-SC1 thinks this assignment being in progress, 
>> so it waits and waits forever
>> May 18 16:32:28.954967 osafamfd [257:257:src/amf/amfd/su.cc:2588] >> 
>> any_susi_fsm_in: SU:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
>> check_fsm:5
>> May 18 16:32:28.954975 osafamfd [257:257:src/amf/amfd/su.cc:2593] TR 
>> SUSI:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon,safSi=AmfDemoTwon,safApp=AmfDemoTwon',
>>  
>> fsm:'5'
>> May 18 16:32:28.954982 osafamfd [257:257:src/amf/amfd/su.cc:2596] TR 
>> Found
>> May 18 16:32:28.954989 osafamfd [257:257:src/amf/amfd/su.cc:2599] << 
>> any_susi_fsm_in
>> May 18 16:32:28.954996 osafamfd [257:257:src/amf/amfd/sg.cc:2340] << 
>> any_assignment_in_progress
>>
>> This problem is very close to the one you mentioned and fixed in 
>> #1725. In #1725, amfnd surely knows amfd down, so amfnd buffers msg. 
>> In #2105, amfnd sends msg out just before amfnd detects amfd being down.
>>>
>>> The logs that I have attached can be ignored. I was simulating the 
>>> bug on different assumptions.
>>>
>>> One question regarding the patch:
>>> If the goal is to fix the issue when the message is being sent and 
>>> system has become SC-less. In this situation, then avnd_mds_send() 
>>> will return, most probably,  a failure as MDS will not find the 
>>> destination. In mds failure case, rec->no_retries will not be 
>>> incremented and will remain zero. Now AMFND will process down of SC 
>>> and it will call avnd_diq_del(). In this function, since no_retries 
>>> is zero for this message(first message),  the message will be deleted.
>>>
>> [Minh]: Thanks, it's good to handle failure code returned from MDS. I 
>> will update the patch
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 18-May-17 9:14 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> Do you have any idea why @is_avd_down was false that made amfnd to 
>>>> send susi_resp at 12:37:20.453974?
>>>> It should be true by the end of avnd_evt_mds_avd_dn_evh() at 
>>>> 12:37:16.741518, is it right?
>>>>
>>>> Thanks,
>>>> Minh
>>>> On 17/05/17 21:31, minh chau wrote:
>>>>> Hi Praveen,
>>>>>
>>>>> Thanks for looking at the issue.
>>>>> Here is what I am observing
>>>>>
>>>>> amfnd-PL3 received NCSMDS_DOWN indicating no active amfd
>>>>>
>>>>> May 17 12:37:16.741308 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0629] >> avnd_evt_mds_avd_dn_evh
>>>>> May 17 12:37:16.741342 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0651] WA AMF director unexpectedly 
>>>>> crashed
>>>>> May 17 12:37:16.741354 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0701] TR Delete all pending 
>>>>> messages to be sent to AMFD
>>>>> May 17 12:37:16.741379 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking 
>>>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages
>>>>> May 17 12:37:16.741405 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking 
>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' for pending messages
>>>>> May 17 12:37:16.741430 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking 
>>>>> 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' for pending messages
>>>>> May 17 12:37:16.741505 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/tmr.cc:0083] TR SC absence timer started
>>>>> May 17 12:37:16.741518 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0742] << avnd_evt_mds_avd_dn_evh
>>>>>
>>>>> But a bit later, susi got assigned, and amfnd-PL3 did send this 
>>>>> susi response (it should not send out and buffer it, since the 
>>>>> @is_avd_down should be true)
>>>>>
>>>>> May 17 12:37:20.453974 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/di.cc:0866] >> avnd_di_susi_resp_send: 
>>>>> Sending Resp su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, 
>>>>> si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1
>>>>> ...
>>>>> May 17 12:37:20.454083 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/mds.cc:1482] >> avnd_mds_send: Msg type '1'
>>>>> May 17 12:37:20.454244 osafamfnd 
>>>>> [8141:8141:src/amf/amfnd/mds.cc:1537] ER ncsmds_api for 0 FAILED, 
>>>>> dest=0
>>>>>
>>>>> When SC1 restarted, amfd received the very first messages from PL3 
>>>>> starting with msg_id=1 (it should be starting from 0)
>>>>>
>>>>> May 17 12:37:28.398633 osafamfd 
>>>>> [7686:7686:src/amf/amfd/ndproc.cc:0330] NO Receive message with 
>>>>> event type:12, msg_type:31, from node:2030f, msg_id:1
>>>>> May 17 12:37:28.413018 osafamfd 
>>>>> [7686:7686:src/amf/amfd/ndfsm.cc:0334] NO Received node_up_msg 
>>>>> from all nodes
>>>>> May 17 12:37:28.413069 osafamfd 
>>>>> [7686:7686:src/amf/amfd/ndfsm.cc:0254] NO Received node_up from 
>>>>> 2030f: msg_id 2
>>>>>
>>>>> Looks to me something should not happen inside 
>>>>> avnd_evt_mds_avd_dn_evh(). In this avnd_evt_mds_avd_dn_evh(), 
>>>>> @is_avd_down should be true, the msg counter should be reset to 0, 
>>>>> but I do see the SC absence timer started. I couldn't figure how 
>>>>> it happened for now
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>>
>>>>> On 17/05/17 20:03, praveen malviya wrote:
>>>>>> What I see is avnd_diq_del() is called as soon as system becomes 
>>>>>> headless. This will delete all pending messages. But when 
>>>>>> component will respond during SCs absence a new message will be 
>>>>>> generated and buffered.
>>>>>> For node_up AMFD will ack the message, but amfnd calls 
>>>>>> avnd_diq_rec_del() (not avnd_diq_del()) in 
>>>>>> avnd_di_msg_ack_process().
>>>>>> We need to call avnd_diq_del() in ack message so that msg_id gets 
>>>>>> updated.
>>>>>> Further looking into it..
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>> Praveen
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17-May-17 1:50 PM, praveen malviya wrote:
>>>>>>> Hi Minh,
>>>>>>>
>>>>>>> While testing this, I am observing that amfd is dropping the 
>>>>>>> assignment
>>>>>>> message because of message id mismatch:
>>>>>>> May 17 12:37:39.522117 osafamfd 
>>>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1171]
>>>>>>>   >> avd_su_si_assign_evh: id:1, node:2030f, act:5,
>>>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', '', ha:3, err:1, 
>>>>>>> single:0
>>>>>>> ....
>>>>>>> ....
>>>>>>> May 17 12:37:39.522404 osafamfd 
>>>>>>> [7686:7686:src/amf/amfd/ndproc.cc:0075]
>>>>>>> WA avd_msg_sanity_chk: invalid msg id 1, msg type 5, from 2030f 
>>>>>>> should be 3
>>>>>>> May 17 12:37:39.522418 osafamfd 
>>>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1777]
>>>>>>> << avd_su_si_assign_evh
>>>>>>>
>>>>>>> I am also looking into this. For your reference I had attached 
>>>>>>> amfd and
>>>>>>> amfnd traces from SC-1 and PL-3 respectively in the ticket.
>>>>>>> I am working with one controller and one payload.
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Praveen
>>>>>>>
>>>>>>> On 15-May-17 1:06 PM, Minh Chau wrote:
>>>>>>>> When amfnd-payload responds susi assignment response just 
>>>>>>>> before both SC
>>>>>>>> go down, and that response message does not come to director. 
>>>>>>>> Therefore,
>>>>>>>> the status of that assignment could be seen as "modifying" in 
>>>>>>>> IMM. When
>>>>>>>> SC comes back, active amfd will be waiting for that response 
>>>>>>>> forever.
>>>>>>>>
>>>>>>>> Patch checks if a susi assignment response is sent but not-ack 
>>>>>>>> just before
>>>>>>>> both SC come down, amfnd-payload will buffer it in a way as a 
>>>>>>>> susi get
>>>>>>>> assigned during SC absence
>>>>>>>> ---
>>>>>>>>    src/amf/amfnd/di.cc | 53 
>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>>>>>>>    1 file changed, 45 insertions(+), 8 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>>>>>> index e06b9260d..3776a09dc 100644
>>>>>>>> --- a/src/amf/amfnd/di.cc
>>>>>>>> +++ b/src/amf/amfnd/di.cc
>>>>>>>> @@ -1282,16 +1282,53 @@ void avnd_di_msg_ack_process(AVND_CB 
>>>>>>>> *cb, uint32_t mid) {
>>>>>>>>      Notes         : None.
>>>>>>>> ******************************************************************************/
>>>>>>>>  
>>>>>>>>
>>>>>>>>    void avnd_diq_del(AVND_CB *cb) {
>>>>>>>> -  AVND_DND_MSG_LIST *rec = 0;
>>>>>>>>    -  do {
>>>>>>>> -    /* pop the record */
>>>>>>>> -    m_AVND_DIQ_REC_POP(cb, rec);
>>>>>>>> -    if (!rec) break;
>>>>>>>> +  if ((cb->dnd_list.head != nullptr)) {
>>>>>>>> +    AVND_DND_MSG_LIST *rec = 0;
>>>>>>>> +    bool found = true;
>>>>>>>> +    while (found) {
>>>>>>>> +      found = false;
>>>>>>>> +      for (rec = cb->dnd_list.head; rec != nullptr;
>>>>>>>> +           rec = rec->next) {
>>>>>>>> +        osafassert(rec->msg.type == AVND_MSG_AVD);
>>>>>>>> +        // delete all pending messages that haven't been sent out
>>>>>>>> +        if (rec->no_retries == 0) {
>>>>>>>> +          m_AVND_DIQ_REC_POP(cb, rec);
>>>>>>>> +          avnd_diq_rec_del(cb, rec);
>>>>>>>> +          break;
>>>>>>>> +        } else {
>>>>>>>> +          // Assignment response had been sent, but not ack 
>>>>>>>> because last
>>>>>>>> +          // controller go down, reset msg_id and will be 
>>>>>>>> resent later
>>>>>>>> +          if (rec->msg.info.avd->msg_type == 
>>>>>>>> AVSV_N2D_INFO_SU_SI_ASSIGN_MSG) {
>>>>>>>> +            if 
>>>>>>>> (rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id != 0) {
>>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id = 0;
>>>>>>>> +              found = true;
>>>>>>>> +              LOG_NO(
>>>>>>>> +                  "Found not-ack su_si_assign msg for SU:'%s', "
>>>>>>>> +                  "SI:'%s', ha_state:'%u', msg_act:'%u', 
>>>>>>>> single_csi:'%u', "
>>>>>>>> +                  "error:'%u', msg_id:'%u'",
>>>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info
>>>>>>>> + .n2d_su_si_assign.su_name),
>>>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info
>>>>>>>> + .n2d_su_si_assign.si_name),
>>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.ha_state,
>>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_act,
>>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign
>>>>>>>> +                      .single_csi,
>>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.error,
>>>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id);
>>>>>>>> +            }
>>>>>>>> +          } else {
>>>>>>>> +            // delete other messages for now
>>>>>>>> +            m_AVND_DIQ_REC_POP(cb, rec);
>>>>>>>> +            avnd_diq_rec_del(cb, rec);
>>>>>>>> +            break;
>>>>>>>> +          }
>>>>>>>> +        }
>>>>>>>>    -    /* delete the record */
>>>>>>>> -    avnd_diq_rec_del(cb, rec);
>>>>>>>> -  } while (1);
>>>>>>>> +      }
>>>>>>>> +    }
>>>>>>>> +  }
>>>>>>>>         return;
>>>>>>>>    }
>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>  
>>>>>>>
>>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>>> engaging tech sites, Slashdot.org! 
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=Luyb_FCgTEXSpVle_diQMuhKxVmmm6cmv5VA03k0Zu8&e=
>>>>>>>  
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Opensaf-devel mailing list
>>>>>>> Opensaf-devel@lists.sourceforge.net
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=e4sg0J1cdg4VnTqeWPDrNZlPv2BuIuFj4Dk7JACxgx8&e=
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to