- **Comment**:

Follow up by Hans Nordebäck :
/=================/=================/=================

On 3/15/2017 1:55 PM, A V Mahesh wrote:
> Hi Hans N,
>
> Try add some debugging logs in LEAP as well and try to reproduce.
>
> My guess is the specific system my running out of FD,
> or the system libraries are not comestible to our LEAP system apis calls.
>
> -AVM
>
> On 3/15/2017 1:48 PM, Hans Nordebäck wrote:
>> Hi Mahesh,
>>
>> ok, I understand the logic now, I'll check your patch but it only checks 
>> NCS_TMR_START? Perhaps we should trace
>> the complete flow to conclude why the descriptor is not signaled, e.g. 
>> mds_tmr_mailbox_processing and so on?
>> /Thanks HansN
>>
>> -----Original Message-----
>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>> Sent: den 15 mars 2017 08:04
>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>
>> Cc: Anders Widell <anders.wid...@ericsson.com>; Nagendra Kumar 
>> <nagendr...@oracle.com>
>> Subject: Re: MDS question #2278
>>
>> Hi Hans N
>>
>>   >> mbcsv_mds_send_msg is 0 and therfore set to -1 when calling poll, 
>> normally it should be 10 seconds or 1 second.
>>
>> No .
>>
>> It is right,  mbcsv_mds_send_msg() --> mds_mcm_time_wait() --> 
>> osaf_poll_one_fd () --> osaf_poll --> osaf_poll_no_timeout( -1 ), if MDS 
>> didn't find the subscription to that peer, MDS manually adds a subscription 
>> entry for the same and starts a discovery_tmr (with time out of 5 sec and 
>> creates sel_obj the same sel_obj added to await_disc_queue list of that 
>> subscription , the and the same sel_obj used for mds_mcm_time_wait() --> 
>> osaf_poll_one_fd () --> osaf_poll --> osaf_poll_no_timeout() If the 
>> subscription doesn't arrive with in 500 * 10 ms, then the
>> subscription_tmr_expiry() function do SEL_OBJ_IND on disc_queue->sel_obj, so 
>> that osaf_poll_no_timeout() will come out of poll.
>> So the only possible theoretical code bug in MDS is
>> mds_mcm_subtn_add()-->mds_subtn_tbl_add()-->m_NCS_TMR_START(subtn_info->discovery_tmr)
>> .
>>
>> Apply my debug patch and see  m_NCS_TMR_START errors.
>>
>> -AVM
>>
>> ||
>>
>>
>> On 3/15/2017 12:04 PM, Hans Nordebäck wrote:
>>> Hi Mahesh,
>>>
>>> Thanks, I'll read the private mail, but with the debug output below,
>>> it seems quite clear why the 'amfd heart beat timeout' happens, the timeout 
>>> value to mbcsv_mds_send_msg is 0 and therfore set to -1 when calling poll, 
>>> normally it should be 10 seconds or 1 second.
>>>
>>> We managed to get a call chain at the time of the fault:
>>>
>>> vas-1:/root # /gstack 2578
>>> Thread 4 (Thread 0x7f5173128b00 (LWP 2584)):
>>> 0  0x00007f5170edbbfd in poll () from /lib64/libc.so.6
>>> 1  0x00007f5172d45261 in osaf_ppoll () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 2  0x00007f5172d453fb in osaf_poll () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 3  0x00007f5172d45445 in osaf_poll_one_fd () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 4  0x00007f5172d760d7 in rda_read_msg(int, char*, int) [clone
>>> .constprop.2] () from /usr/lib64/libopensaf_core.so.0
>>> 5  0x00007f5172d76364 in rda_callback_task(RDA_CALLBACK_CB*) () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 6  0x00007f51720600a4 in start_thread () from /lib64/libpthread.so.0
>>> 7  0x00007f5170ee402d in clone () from /lib64/libc.so.6 Thread 3
>>> (Thread 0x7f5173148b00 (LWP 2583)):
>>> 0  0x00007f5170edbbfd in poll () from /lib64/libc.so.6
>>> 1  0x00007f5172d6d920 in mdtm_process_recv_events_tcp () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 2  0x00007f51720600a4 in start_thread () from /lib64/libpthread.so.0
>>> 3  0x00007f5170ee402d in clone () from /lib64/libc.so.6 Thread 2
>>> (Thread 0x7f5173188b00 (LWP 2581)):
>>> 0  0x00007f5170edbbfd in poll () from /lib64/libc.so.6
>>> 1  0x00007f5172d45261 in osaf_ppoll () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 2  0x00007f5172d4c3df in ncs_tmr_wait () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 3  0x00007f51720600a4 in start_thread () from /lib64/libpthread.so.0
>>> 4  0x00007f5170ee402d in clone () from /lib64/libc.so.6 Thread 1
>>> (Thread 0x7f51731ab740 (LWP 2578)):
>>> 0  0x00007f5170edbbfd in poll () from /lib64/libc.so.6
>>> 1  0x00007f5172d45261 in osaf_ppoll () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 2  0x00007f5172d453fb in osaf_poll () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 3  0x00007f5172d45445 in osaf_poll_one_fd () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 4  0x00007f5172d62457 in mds_mcm_time_wait () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 5  0x00007f5172d626bc in mds_subtn_tbl_add_disc_queue.isra () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 6  0x00007f5172d628c3 in mds_mcm_process_disc_queue_checks_redundant
>>> () from /usr/lib64/libopensaf_core.so.0
>>>7  0x00007f5172d632e5 in mcm_pvt_red_snd_process_common () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 8  0x00007f5172d663ad in mds_send () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 9  0x00007f5172d708b0 in ncsmds_api () from
>>> /usr/lib64/libopensaf_core.so.0
>>>10 0x00007f5172d5187f in mbcsv_mds_send_msg () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 11 0x00007f5172d56d60 in mbcsv_send_ckpt_data_to_all_peers () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 12 0x00007f5172d4fff7 in mbcsv_process_snd_ckpt_request () from
>>> /usr/lib64/libopensaf_core.so.0
>>> 13 0x000000000040bfc0 in avsv_send_ckpt_data(cl_cb_tag*, unsigned
>>> int, unsigned long, unsigned int, unsigned int) ()
>>> 14 0x000000000041dfc7 in
>>> AVD_COMP::avd_comp_oper_state_set(SaAmfOperationalStateT) ()
>>> 15 0x0000000000494f8a in AVD_SU::disable_comps(SaAisErrorT) ()
>>>16 0x00000000004804e9 in avd_node_down_appl_susi_failover(cl_cb_tag*,
>>> AVD_AVND*) ()
>>> 17 0x00000000004437d5 in avd_node_failover(AVD_AVND*) ()
>>> 18 0x00000000004418c0 in avd_mds_avnd_down_evh(cl_cb_tag*, AVD_EVT*)
>>> ()
>>> 19 0x000000000043df14 in process_event(cl_cb_tag*, AVD_EVT*) ()
>>> 20 0x0000000000407a86 in main ()
>>>
>>> /Regards HansN
>>>
>>> -----Original Message-----
>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>> Sent: den 15 mars 2017 04:35
>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders Widell
>>> <anders.wid...@ericsson.com>
>>> Subject: Re: MDS question
>>>
>>> Hi Hans N,
>>>
>>> I did explained the MDS functionality and my debug observation in
>>> ###https://sourceforge.net/p/opensaf/tickets/2278/,
>>> so  please do reproduce the problem issue with  debug patch attached in 
>>> Ticket  and share the logs.
>>>
>>> I will also FW the private mail  between Anders and us  FYI.
>>>
>>> -AVM
>>>
>>> On 3/14/2017 4:52 PM, Hans Nordebäck wrote:
>>>> Hi Mahesh,
>>>>
>>>> AndersW mentioned that this problem seems related to ticket #2278,
>>>> and I think it is.
>>>>
>>>> I have log files and amfd trace files if needed.
>>>>
>>>> /Thanks Hans
>>>>
>>>>
>>>> On 03/14/2017 11:19 AM, Hans Nordebäck wrote:
>>>>> Hi Mahesh,
>>>>>
>>>>> we have a problem with a system when it e.g. scales down to one
>>>>> node, there is a problem at Amf mbcsv to all
>>>>>
>>>>> standby peers,
>>>>>
>>>>>    :
>>>>>
>>>>> Mar 14  9:49:19.022456 osafamfd [1133:mbcsv_util.c:0441] TR
>>>>> dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE Mar 14
>>>>> 9:49:19.022466 osafamfd [1133:mbcsv_act.c:0101] TR ASYNC update to
>>>>> be sent. role: 1,
>>>>> svc_id: 10, pwe_hdl: 65537 Mar 14  9:49:19.022477 osafamfd
>>>>> [1133:mbcsv_mds.c:0185] >>
>>>>> mbcsv_mds_send_msg: sending to vdest:1 Mar 14  9:49:19.022487
>>>>> osafamfd [1133:mbcsv_mds.c:0201] TR send type MDS_SENDTYPE_RED Mar
>>>>> 14
>>>>> 9:49:19.022526 osafamfd [1133:mds_c_sndrcv.c:2569] TR DEBUG:
>>>>> mds_mcm_time_wait 0
>>>>> Mar 14  9:49:19.025182 osafamfd [1133:mds_dt_trans.c:0608] >>
>>>>> mdtm_process_poll_recv_data_tcp
>>>>>
>>>>> I added a new trace in mds_c_sndrcv.c that shows that wait time is
>>>>> zero, normally it should be 10 seconds or 1 second.
>>>>>
>>>>> The 0 is translated to -1 causing the following poll to wait forever
>>>>> as there are no standby nodes to respond and Amfnd heart beat
>>>>>
>>>>> time out happens.
>>>>>
>>>>> Not sure but I think the problem may be in this MDS function:
>>>>>
>>>>> static uint32_t
>>>>> mds_mcm_process_disc_queue_checks_redundant(MDS_SVC_INFO *svc_cb,
>>>>> MDS_SVC_ID dest_svc_id,
>>>>>                                MDS_VDEST_ID dest_vdest_id, V_DEST_QA
>>>>> anchor,
>>>>>                                MDS_SEND_INFO *req) {
>>>>>
>>>>>       MDS_SUBSCRIPTION_INFO *sub_info = NULL;
>>>>>       MDS_SUBSCRIPTION_RESULTS_INFO *log_subtn_result_info = NULL;
>>>>>       uint32_t disc_rc;
>>>>>       bool time_wait = false;
>>>>>
>>>>>       MDS_HDL env_hdl;
>>>>>
>>>>>       env_hdl =
>>>>> (MDS_HDL)(m_MDS_GET_PWE_HDL_FROM_SVC_HDL(svc_cb->svc_hdl));
>>>>>       mds_subtn_tbl_get(svc_cb->svc_hdl, dest_svc_id, &sub_info);
>>>>>
>>>>>       if (sub_info == NULL) {
>>>>>           /* No subscription to this */
>>>>>           /* Make a subscription to this service */
>>>>>           m_MDS_LOG_INFO("MDS_SND_RCV: No subscription to svc_id =
>>>>> %s(%d), Making subscription\n", get_svc_names(dest_svc_id),
>>>>> dest_svc_id);
>>>>>           mds_mcm_subtn_add(svc_cb->svc_hdl, dest_svc_id,
>>>>> svc_cb->install_scope,
>>>>>                     MDS_VIEW_RED /* redundantview */ ,
>>>>> MDS_SUBTN_IMPLICIT);
>>>>>
>>>>>           if (NCSCC_RC_SUCCESS != mds_subtn_tbl_get(svc_cb->svc_hdl,
>>>>> dest_svc_id, &sub_info)) {
>>>>>               m_MDS_LOG_INFO("MDS_SND_RCV: Subscription made but no
>>>>> pointer available\n");
>>>>>               return NCSCC_RC_FAILURE;
>>>>>           }
>>>>>       } else if (sub_info->tmr_flag != true) {
>>>>>           if ((MDS_SENDTYPE_RSP == req->i_sendtype) ||
>>>>> (MDS_SENDTYPE_RRSP == req->i_sendtype)) {
>>>>>               time_wait = true;
>>>>>               m_MDS_LOG_INFO("MDS_SND_RCV:Disc queue red: Subscr
>>>>> exists no timer running: Waiting for some time\n");
>>>>>           } else {
>>>>>               m_MDS_LOG_INFO("MDS_SND_RCV: Subscription exists but
>>>>> Timer has expired\n");
>>>>>               return NCSCC_RC_FAILURE;
>>>>>           }
>>>>>       }
>>>>>
>>>>>      :
>>>>>
>>>>> if the call to mds_subtn_tbl_get sets sub_info to NULL, time_wait
>>>>> will remain false, and then calling mds_subtn_tbl_add_disc_queue
>>>>>
>>>>> with false causes infinite wait in the following poll. Can this if
>>>>> stmt be moved so it is not depended on sub_info?
>>>>>
>>>>> if ((MDS_SENDTYPE_RSP == req->i_sendtype) || (MDS_SENDTYPE_RRSP ==
>>>>> req->i_sendtype)) {
>>>>>
>>>>>
>>>>> /Regards Hans
>>>>>
/=================/=================/=================






---

** [tickets:#2278] mds: Blocking send causes AMF health check time-out**

**Status:** assigned
**Milestone:** 5.1.1
**Created:** Thu Jan 26, 2017 09:49 AM UTC by Anders Widell
**Last Updated:** Mon Feb 13, 2017 08:54 AM UTC
**Owner:** A V Mahesh (AVM)


AMF health-check time-out is seen on SC-1 after restarting SC-2. The system is 
using OpenSAF 5.1.0 configured with TCP communication.

Syslog:

~~~
2017-01-20T18:29:04.405982+01:00 local0.err SC-1 osafamfnd[2820]: ER AMF 
director heart beat timeout, generating core for amfd
2017-01-20T18:29:05.408819+01:00 local0.crit SC-1 osafamfnd[2820]: Rebooting 
OpenSAF NodeId = 131343 EE Name = , Reason: AMF director heart beat timeout, 
OwnNodeId = 131343, SupervisionTime = 0
~~~

Back-trace of osafamfd:

~~~
0x7fa316cceb60 osaf_poll_no_timeout (osaf/libs/core/common/osaf_poll.c:33)
0x7fa316ccede5 osaf_poll (osaf/libs/core/common/osaf_poll.c:45)
0x7fa316ccee25 osaf_poll_one_fd (osaf/libs/core/common/osaf_poll.c:129)
0x7fa316cfab67 mds_mcm_time_wait 
(osaf/libs/core/common/include/osaf_utility.h:79)
0x7fa316cfae51 mds_subtn_tbl_add_disc_queue 
(osaf/libs/core/mds/mds_c_sndrcv.c:1808)
0x7fa316cfb03d mds_mcm_process_disc_queue_checks_redundant 
(osaf/libs/core/mds/mds_c_sndrcv.c:2338)
0x7fa316cfbcd1 mcm_pvt_red_snd_process_common 
(osaf/libs/core/mds/mds_c_sndrcv.c:2257)
0x7fa316cfd04d mcm_pvt_red_svc_snd (osaf/libs/core/mds/mds_c_sndrcv.c:2174)
0x7fa316cff8f9 mds_send (osaf/libs/core/mds/mds_c_sndrcv.c:736)
0x7fa316cf9068 ncsmds_api (osaf/libs/core/mds/mds_papi.c:191)
0x7fa316ce6f5f mbcsv_mds_send_msg (osaf/libs/core/mbcsv/mbcsv_mds.c:239)
0x7fa316cec440 mbcsv_send_ckpt_data_to_all_peers 
(osaf/libs/core/mbcsv/mbcsv_util.c:479)
0x7fa316ce56d7 mbcsv_process_snd_ckpt_request 
(osaf/libs/core/mbcsv/mbcsv_api.c:862)
0x40bfc0 avsv_send_ckpt_data(cl_cb_tag*, unsigned int, unsigned long, unsigned 
int, unsigned int) (osaf/services/saf/amf/amfd/chkop.cc:1062)
0x446649 avd_node_oper_state_set(AVD_AVND*, SaAmfOperationalStateT) 
(osaf/services/saf/amf/amfd/node.cc:505)
0x44040c avd_node_mark_absent(AVD_AVND*) 
(osaf/services/saf/amf/amfd/ndfsm.cc:1018)
0x4438ba avd_node_failover(AVD_AVND*) 
(osaf/services/saf/amf/amfd/ndproc.cc:1141)

~~~


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to