Hi Anders Widell ,

Piratically  hitting the 
`mbcsv_send_ckpt_data_to_all_peers(MDS_SENDTYPE_RED)---> .... --> 
mds_mcm_time_wait() --> osaf_poll_one_fd () --> osaf_poll --> 
osaf_poll_no_timeout()`  flow not reproducible, but  theoretically  based on 
MDS code  it looks possible , if  fails to START `await_disc_queue` NCS_TMR.

In normally functionality, if  AVD try to send NCS_MBCSV_SND_USR_ASYNC mesage 
for NON existing peer AVD then the  Mbcsv  translates it in to  
`MDS_SENDTYPE_RED` mds message and sends, if MDS didn't find the subscription 
to that peer, MDS manually adds a subscription entry for the same and starts a 
`discovery_tmr`  (with time out of  5 sec and creates `sel_obj`
the same  `sel_obj` added to await_disc_queue list of that subscription , the 
and the same  `sel_obj` used for  `mds_mcm_time_wait() --> osaf_poll_one_fd () 
--> osaf_poll --> osaf_poll_no_timeout()`

If the subscription doesn't arrive with in 500 * 10 ms, then the 
subscription_tmr_expiry() function do SEL_OBJ_IND on disc_queue->sel_obj, so 
that osaf_poll_no_timeout() will come out of poll.

But in your case the osaf_poll_no_timeout() doesn't come out of poll, till the 
10 sec which is above the discovery_tmr expiration time.

So the only possible theoretical code bug in MDS is  
`mds_mcm_subtn_add()-->mds_subtn_tbl_add()-->m_NCS_TMR_START(subtn_info->discovery_tmr)`
 ,
Error handling is not proper, even though the the discovery_tmr not started, 
code assumes  the  discovery_tmr started successfully as a result  SEL_OBJ_IND  
will not be done on disc_queue->sel_obj , the  AVD can go wait for ever in 
osaf_poll_no_timeout().

So for this ticket, i can do improvements error handling of  
mds_mcm_subtn_add() function and can provide , do you have any suggestions ?

-AVM



---

** [tickets:#2278] mds: Blocking send causes AMF health check time-out**

**Status:** assigned
**Milestone:** 5.1.1
**Created:** Thu Jan 26, 2017 09:49 AM UTC by Anders Widell
**Last Updated:** Thu Feb 09, 2017 08:45 AM UTC
**Owner:** A V Mahesh (AVM)


AMF health-check time-out is seen on SC-1 after restarting SC-2. The system is 
using OpenSAF 5.1.0 configured with TCP communication.

Syslog:

~~~
2017-01-20T18:29:04.405982+01:00 local0.err SC-1 osafamfnd[2820]: ER AMF 
director heart beat timeout, generating core for amfd
2017-01-20T18:29:05.408819+01:00 local0.crit SC-1 osafamfnd[2820]: Rebooting 
OpenSAF NodeId = 131343 EE Name = , Reason: AMF director heart beat timeout, 
OwnNodeId = 131343, SupervisionTime = 0
~~~

Back-trace of osafamfd:

~~~
0x7fa316cceb60 osaf_poll_no_timeout (osaf/libs/core/common/osaf_poll.c:33)
0x7fa316ccede5 osaf_poll (osaf/libs/core/common/osaf_poll.c:45)
0x7fa316ccee25 osaf_poll_one_fd (osaf/libs/core/common/osaf_poll.c:129)
0x7fa316cfab67 mds_mcm_time_wait 
(osaf/libs/core/common/include/osaf_utility.h:79)
0x7fa316cfae51 mds_subtn_tbl_add_disc_queue 
(osaf/libs/core/mds/mds_c_sndrcv.c:1808)
0x7fa316cfb03d mds_mcm_process_disc_queue_checks_redundant 
(osaf/libs/core/mds/mds_c_sndrcv.c:2338)
0x7fa316cfbcd1 mcm_pvt_red_snd_process_common 
(osaf/libs/core/mds/mds_c_sndrcv.c:2257)
0x7fa316cfd04d mcm_pvt_red_svc_snd (osaf/libs/core/mds/mds_c_sndrcv.c:2174)
0x7fa316cff8f9 mds_send (osaf/libs/core/mds/mds_c_sndrcv.c:736)
0x7fa316cf9068 ncsmds_api (osaf/libs/core/mds/mds_papi.c:191)
0x7fa316ce6f5f mbcsv_mds_send_msg (osaf/libs/core/mbcsv/mbcsv_mds.c:239)
0x7fa316cec440 mbcsv_send_ckpt_data_to_all_peers 
(osaf/libs/core/mbcsv/mbcsv_util.c:479)
0x7fa316ce56d7 mbcsv_process_snd_ckpt_request 
(osaf/libs/core/mbcsv/mbcsv_api.c:862)
0x40bfc0 avsv_send_ckpt_data(cl_cb_tag*, unsigned int, unsigned long, unsigned 
int, unsigned int) (osaf/services/saf/amf/amfd/chkop.cc:1062)
0x446649 avd_node_oper_state_set(AVD_AVND*, SaAmfOperationalStateT) 
(osaf/services/saf/amf/amfd/node.cc:505)
0x44040c avd_node_mark_absent(AVD_AVND*) 
(osaf/services/saf/amf/amfd/ndfsm.cc:1018)
0x4438ba avd_node_failover(AVD_AVND*) 
(osaf/services/saf/amf/amfd/ndproc.cc:1141)

~~~


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to