Hi Praveen,
Some comments in line with [Minh]
thanks,
Minh
On 18/05/17 14:54, praveen malviya wrote:
> Hi Minh,
>
> In the description of the ticket there is a log which is :
> "
> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO Assigned
> 'safSi=TestApp_SI4,safApp=TestApp_TwoN' ACTIVE to
> 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN'
> Oct 7 18:31:41 SYSTEST-PLD-1 osafamfnd[12467]: NO
> avnd_di_susi_resp_send() deferred as AMF director is offline
> "
> Last line in above log means AMFND was sending the message when it new
> about SC absence state. I think this issue is already fixed during
> #1725 and this published patch is not required. Why? After led set
> message amfnd will anyway send this message.
[Minh] I have reproduced the problem and attached to ticket for your
reference.
Some outlined logs:
The step is stopping SC1, SC2.
In SC2, amfd sent susi assignment req to amfnd-PL3
May 18 16:32:03.633226 osafamfd [245:245:src/amf/amfd/sgproc.cc:2444] >>
avd_sg_su_si_mod_snd: 'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
state 1
In PL3, amfnd completed this susi req, and sent susi resp successfully
but it did not reach to amfd-SC2
May 18 16:32:03.641156 osafamfnd [186:186:src/amf/amfnd/su.cc:0373] >>
avnd_evt_avd_info_su_si_assign_evh:
'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon'
May 18 16:32:03.641744 osafamfnd [186:186:src/amf/amfnd/di.cc:0866] >>
avnd_di_susi_resp_send: Sending Resp
su=safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon,
si=safSi=AmfDemoTwon,safApp=AmfDemoTwon, curr_state=1, prv_state=2
amfnd-PL3 is notified NCSMDS_DOWN, amfnd deleted all pending msg waiting
for ack
May 18 16:32:05.568471 osafamfnd [186:186:src/amf/amfnd/di.cc:0629] >>
avnd_evt_mds_avd_dn_evh
May 18 16:32:05.568492 osafamfnd [186:186:src/amf/amfnd/di.cc:0651] WA
AMF director unexpectedly crashed
May 18 16:32:05.568495 osafamfnd [186:186:src/amf/amfnd/di.cc:0701] TR
Delete all pending messages to be sent to AMFD
May 18 16:32:05.568498 osafamfnd [186:186:src/amf/amfnd/di.cc:1353] >>
avnd_diq_rec_del
May 18 16:32:05.568503 osafamfnd [186:186:src/amf/amfnd/di.cc:1369] <<
avnd_diq_rec_del
When SC restarts, amfd-SC1 thinks this assignment being in progress, so
it waits and waits forever
May 18 16:32:28.954967 osafamfd [257:257:src/amf/amfd/su.cc:2588] >>
any_susi_fsm_in: SU:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon',
check_fsm:5
May 18 16:32:28.954975 osafamfd [257:257:src/amf/amfd/su.cc:2593] TR
SUSI:'safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon,safSi=AmfDemoTwon,safApp=AmfDemoTwon',
fsm:'5'
May 18 16:32:28.954982 osafamfd [257:257:src/amf/amfd/su.cc:2596] TR Found
May 18 16:32:28.954989 osafamfd [257:257:src/amf/amfd/su.cc:2599] <<
any_susi_fsm_in
May 18 16:32:28.954996 osafamfd [257:257:src/amf/amfd/sg.cc:2340] <<
any_assignment_in_progress
This problem is very close to the one you mentioned and fixed in #1725.
In #1725, amfnd surely knows amfd down, so amfnd buffers msg. In #2105,
amfnd sends msg out just before amfnd detects amfd being down.
>
> The logs that I have attached can be ignored. I was simulating the bug
> on different assumptions.
>
> One question regarding the patch:
> If the goal is to fix the issue when the message is being sent and
> system has become SC-less. In this situation, then avnd_mds_send()
> will return, most probably, a failure as MDS will not find the
> destination. In mds failure case, rec->no_retries will not be
> incremented and will remain zero. Now AMFND will process down of SC
> and it will call avnd_diq_del(). In this function, since no_retries is
> zero for this message(first message), the message will be deleted.
>
[Minh]: Thanks, it's good to handle failure code returned from MDS. I
will update the patch
>
> Thanks,
> Praveen
>
>
> On 18-May-17 9:14 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Do you have any idea why @is_avd_down was false that made amfnd to
>> send susi_resp at 12:37:20.453974?
>> It should be true by the end of avnd_evt_mds_avd_dn_evh() at
>> 12:37:16.741518, is it right?
>>
>> Thanks,
>> Minh
>> On 17/05/17 21:31, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Thanks for looking at the issue.
>>> Here is what I am observing
>>>
>>> amfnd-PL3 received NCSMDS_DOWN indicating no active amfd
>>>
>>> May 17 12:37:16.741308 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0629] >> avnd_evt_mds_avd_dn_evh
>>> May 17 12:37:16.741342 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0651] WA AMF director unexpectedly
>>> crashed
>>> May 17 12:37:16.741354 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0701] TR Delete all pending messages
>>> to be sent to AMFD
>>> May 17 12:37:16.741379 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking
>>> 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' for pending messages
>>> May 17 12:37:16.741405 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking
>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' for pending messages
>>> May 17 12:37:16.741430 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0709] NO Checking
>>> 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' for pending messages
>>> May 17 12:37:16.741505 osafamfnd
>>> [8141:8141:src/amf/amfnd/tmr.cc:0083] TR SC absence timer started
>>> May 17 12:37:16.741518 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0742] << avnd_evt_mds_avd_dn_evh
>>>
>>> But a bit later, susi got assigned, and amfnd-PL3 did send this susi
>>> response (it should not send out and buffer it, since the
>>> @is_avd_down should be true)
>>>
>>> May 17 12:37:20.453974 osafamfnd
>>> [8141:8141:src/amf/amfnd/di.cc:0866] >> avnd_di_susi_resp_send:
>>> Sending Resp su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1,
>>> si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1
>>> ...
>>> May 17 12:37:20.454083 osafamfnd
>>> [8141:8141:src/amf/amfnd/mds.cc:1482] >> avnd_mds_send: Msg type '1'
>>> May 17 12:37:20.454244 osafamfnd
>>> [8141:8141:src/amf/amfnd/mds.cc:1537] ER ncsmds_api for 0 FAILED,
>>> dest=0
>>>
>>> When SC1 restarted, amfd received the very first messages from PL3
>>> starting with msg_id=1 (it should be starting from 0)
>>>
>>> May 17 12:37:28.398633 osafamfd
>>> [7686:7686:src/amf/amfd/ndproc.cc:0330] NO Receive message with
>>> event type:12, msg_type:31, from node:2030f, msg_id:1
>>> May 17 12:37:28.413018 osafamfd
>>> [7686:7686:src/amf/amfd/ndfsm.cc:0334] NO Received node_up_msg from
>>> all nodes
>>> May 17 12:37:28.413069 osafamfd
>>> [7686:7686:src/amf/amfd/ndfsm.cc:0254] NO Received node_up from
>>> 2030f: msg_id 2
>>>
>>> Looks to me something should not happen inside
>>> avnd_evt_mds_avd_dn_evh(). In this avnd_evt_mds_avd_dn_evh(),
>>> @is_avd_down should be true, the msg counter should be reset to 0,
>>> but I do see the SC absence timer started. I couldn't figure how it
>>> happened for now
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 17/05/17 20:03, praveen malviya wrote:
>>>> What I see is avnd_diq_del() is called as soon as system becomes
>>>> headless. This will delete all pending messages. But when component
>>>> will respond during SCs absence a new message will be generated and
>>>> buffered.
>>>> For node_up AMFD will ack the message, but amfnd calls
>>>> avnd_diq_rec_del() (not avnd_diq_del()) in avnd_di_msg_ack_process().
>>>> We need to call avnd_diq_del() in ack message so that msg_id gets
>>>> updated.
>>>> Further looking into it..
>>>>
>>>>
>>>> Thanks.
>>>> Praveen
>>>>
>>>>
>>>>
>>>> On 17-May-17 1:50 PM, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> While testing this, I am observing that amfd is dropping the
>>>>> assignment
>>>>> message because of message id mismatch:
>>>>> May 17 12:37:39.522117 osafamfd
>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1171]
>>>>> >> avd_su_si_assign_evh: id:1, node:2030f, act:5,
>>>>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', '', ha:3, err:1, single:0
>>>>> ....
>>>>> ....
>>>>> May 17 12:37:39.522404 osafamfd
>>>>> [7686:7686:src/amf/amfd/ndproc.cc:0075]
>>>>> WA avd_msg_sanity_chk: invalid msg id 1, msg type 5, from 2030f
>>>>> should be 3
>>>>> May 17 12:37:39.522418 osafamfd
>>>>> [7686:7686:src/amf/amfd/sgproc.cc:1777]
>>>>> << avd_su_si_assign_evh
>>>>>
>>>>> I am also looking into this. For your reference I had attached
>>>>> amfd and
>>>>> amfnd traces from SC-1 and PL-3 respectively in the ticket.
>>>>> I am working with one controller and one payload.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Praveen
>>>>>
>>>>> On 15-May-17 1:06 PM, Minh Chau wrote:
>>>>>> When amfnd-payload responds susi assignment response just before
>>>>>> both SC
>>>>>> go down, and that response message does not come to director.
>>>>>> Therefore,
>>>>>> the status of that assignment could be seen as "modifying" in
>>>>>> IMM. When
>>>>>> SC comes back, active amfd will be waiting for that response
>>>>>> forever.
>>>>>>
>>>>>> Patch checks if a susi assignment response is sent but not-ack
>>>>>> just before
>>>>>> both SC come down, amfnd-payload will buffer it in a way as a
>>>>>> susi get
>>>>>> assigned during SC absence
>>>>>> ---
>>>>>> src/amf/amfnd/di.cc | 53
>>>>>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>>>>> 1 file changed, 45 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>>>> index e06b9260d..3776a09dc 100644
>>>>>> --- a/src/amf/amfnd/di.cc
>>>>>> +++ b/src/amf/amfnd/di.cc
>>>>>> @@ -1282,16 +1282,53 @@ void avnd_di_msg_ack_process(AVND_CB *cb,
>>>>>> uint32_t mid) {
>>>>>> Notes : None.
>>>>>> ******************************************************************************/
>>>>>>
>>>>>>
>>>>>> void avnd_diq_del(AVND_CB *cb) {
>>>>>> - AVND_DND_MSG_LIST *rec = 0;
>>>>>> - do {
>>>>>> - /* pop the record */
>>>>>> - m_AVND_DIQ_REC_POP(cb, rec);
>>>>>> - if (!rec) break;
>>>>>> + if ((cb->dnd_list.head != nullptr)) {
>>>>>> + AVND_DND_MSG_LIST *rec = 0;
>>>>>> + bool found = true;
>>>>>> + while (found) {
>>>>>> + found = false;
>>>>>> + for (rec = cb->dnd_list.head; rec != nullptr;
>>>>>> + rec = rec->next) {
>>>>>> + osafassert(rec->msg.type == AVND_MSG_AVD);
>>>>>> + // delete all pending messages that haven't been sent out
>>>>>> + if (rec->no_retries == 0) {
>>>>>> + m_AVND_DIQ_REC_POP(cb, rec);
>>>>>> + avnd_diq_rec_del(cb, rec);
>>>>>> + break;
>>>>>> + } else {
>>>>>> + // Assignment response had been sent, but not ack
>>>>>> because last
>>>>>> + // controller go down, reset msg_id and will be resent
>>>>>> later
>>>>>> + if (rec->msg.info.avd->msg_type ==
>>>>>> AVSV_N2D_INFO_SU_SI_ASSIGN_MSG) {
>>>>>> + if
>>>>>> (rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id != 0) {
>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id = 0;
>>>>>> + found = true;
>>>>>> + LOG_NO(
>>>>>> + "Found not-ack su_si_assign msg for SU:'%s', "
>>>>>> + "SI:'%s', ha_state:'%u', msg_act:'%u',
>>>>>> single_csi:'%u', "
>>>>>> + "error:'%u', msg_id:'%u'",
>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info
>>>>>> + .n2d_su_si_assign.su_name),
>>>>>> + osaf_extended_name_borrow(&rec->msg.info.avd->msg_info
>>>>>> + .n2d_su_si_assign.si_name),
>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.ha_state,
>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_act,
>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign
>>>>>> + .single_csi,
>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.error,
>>>>>> + rec->msg.info.avd->msg_info.n2d_su_si_assign.msg_id);
>>>>>> + }
>>>>>> + } else {
>>>>>> + // delete other messages for now
>>>>>> + m_AVND_DIQ_REC_POP(cb, rec);
>>>>>> + avnd_diq_rec_del(cb, rec);
>>>>>> + break;
>>>>>> + }
>>>>>> + }
>>>>>> - /* delete the record */
>>>>>> - avnd_diq_rec_del(cb, rec);
>>>>>> - } while (1);
>>>>>> + }
>>>>>> + }
>>>>>> + }
>>>>>> return;
>>>>>> }
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, Slashdot.org!
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=Luyb_FCgTEXSpVle_diQMuhKxVmmm6cmv5VA03k0Zu8&e=
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Opensaf-devel mailing list
>>>>> [email protected]
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Lehk1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4&m=KwsqvdArvOJV5IkAidvFxTT0JBVpgHVYUwJOsjK9dt4&s=e4sg0J1cdg4VnTqeWPDrNZlPv2BuIuFj4Dk7JACxgx8&e=
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel