Hi Anders, I did try to reproduce quickly , not successful in reproducing ,It looks to me some network interfaces boot order and openSAF start related issue,
I will try to reproduce the problem and update you.
================================================================================ On 8/16/2017 2:02 PM, Anders Widell wrote:> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: dtm_node_add failed .node_ip: 10.0.3.104, node_id: 0
> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: accept() failed > 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: NO Lost contact with 'PL-4'> 2017-08-15 10:41:21.319 SC-1 kernel:[6471411.217909] IPv6: ADDRCONF(NETDEV_UP): veth1BBPKW: link is not ready
================================================================================ -AVM On 8/16/2017 2:02 PM, Anders Widell wrote:
Here are the logs from SC-1. It seems to be starting at the same time as PL-4:2017-08-15 10:41:20.262 SC-1 opensafd: Starting OpenSAF Services(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) (Using TCP) 2017-08-15 10:41:20.263 SC-1 opensafd[113]: /etc/init.d/opensafd: 113: /etc/init.d/opensafd: cannot create /proc/sys/kernel/core_pattern: Read-only file system 2017-08-15 10:41:20.289 SC-1 osafdtmd[156]: mkfifo already exists: /var/lib/opensaf/osafdtmd.fifo File exists2017-08-15 10:41:20.291 SC-1 osafdtmd[156]: Started 2017-08-15 10:41:20.296 SC-1 osaftransportd[166]: Started 2017-08-15 10:41:20.307 SC-1 osafclmna[170]: Started 2017-08-15 10:41:20.320 SC-1 osafrded[179]: Started 2017-08-15 10:41:20.337 SC-1 osaffmd[188]: Started 2017-08-15 10:41:20.338 SC-1 osaffmd[188]: NO Remote fencing is disabled 2017-08-15 10:41:20.354 SC-1 osafimmd[198]: Started 2017-08-15 10:41:20.370 SC-1 osafimmnd[209]: Started2017-08-15 10:41:20.371 SC-1 osafimmnd[209]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added) 2017-08-15 10:41:20.396 SC-1 opensafd[145]: NO Monitoring of TRANSPORT started 2017-08-15 10:41:20.507 SC-1 osafclmna[170]: NO Starting to promote this node to a system controller2017-08-15 10:41:20.507 SC-1 osafrded[179]: NO Requesting ACTIVE role 2017-08-15 10:41:20.507 SC-1 osafrded[179]: NO RDE role set to Undefined2017-08-15 10:41:20.542 SC-1 osafdtmd[156]: NO Established contact with 'PL-4' 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: dtm_node_add failed .node_ip: 10.0.3.104, node_id: 02017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: accept() failed 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: NO Lost contact with 'PL-4'2017-08-15 10:41:21.319 SC-1 kernel:[6471411.217909] IPv6: ADDRCONF(NETDEV_UP): veth1BBPKW: link is not ready 2017-08-15 10:41:22.509 SC-1 osafrded[179]: NO Running '/usr/local/lib/opensaf/opensaf_sc_active' with 0 argument(s) 2017-08-15 10:41:22.534 SC-1 osafrded[179]: NO Switched to ACTIVE from Undefined 2017-08-15 10:41:22.534 SC-1 osaffmd[188]: NO Starting activation supervision: 300000ms 2017-08-15 10:41:22.535 SC-1 osafimmnd[209]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:22.535 SC-1 osafrded[179]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:22.535 SC-1 osafclmna[170]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:22.536 SC-1 osaftransportd[166]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: osafdtmd Process down, Rebooting the node, OwnNodeId = 0, SupervisionTime = 60 2017-08-15 10:41:22.546 SC-1 opensaf_reboot: Rebooting local node; timeout=60On 08/16/2017 10:22 AM, Anders Widell wrote:It is quite easy to reproduce the "dtm_node_add failed" log message, which leads to a reboot. I have only seen the core dump once, though.Here is the log from the time when I got the core dump: 2017-08-15 10:41:15.218 PL-4 systemd[1]: Started LSB: ping controllers.2017-08-15 10:41:15.219 PL-4 systemd[1]: Failed to reset devices.list on /system.slice/opensafd.service: Operation not permitted2017-08-15 10:41:15.219 PL-4 systemd[1]: Starting OpenSAF daemon...2017-08-15 10:41:15.219 PL-4 systemd[1]: Failed to reset devices.list on /system.slice/plsync.service: Operation not permitted 2017-08-15 10:41:15.253 PL-4 opensafd: Starting OpenSAF Services(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) (Using TCP) 2017-08-15 10:41:15.254 PL-4 opensafd[134]: /etc/init.d/opensafd: 113: /etc/init.d/opensafd: cannot create /proc/sys/kernel/core_pattern: Read-only file system 2017-08-15 10:41:15.279 PL-4 osafdtmd[169]: mkfifo already exists: /var/lib/opensaf/osafdtmd.fifo File exists2017-08-15 10:41:15.280 PL-4 osafdtmd[169]: Started 2017-08-15 10:41:15.284 PL-4 osaftransportd[177]: Started 2017-08-15 10:41:15.297 PL-4 osafclmna[183]: Started 2017-08-15 10:41:15.312 PL-4 osafimmnd[193]: Started2017-08-15 10:41:15.314 PL-4 osafimmnd[193]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added) 2017-08-15 10:41:15.383 PL-4 opensafd[158]: NO Monitoring of TRANSPORT started 2017-08-15 10:41:15.384 PL-4 opensafd[158]: NO Monitoring of CLMNA started 2017-08-15 10:41:15.498 PL-4 osafclmna[183]: NO Starting to promote this node to a system controller 2017-08-15 10:41:15.532 PL-4 osafdtmd[169]: NO Established contact with 'PL-3' 2017-08-15 10:41:15.533 PL-4 osafdtmd[169]: NO Established contact with 'SC-2' 2017-08-15 10:41:15.534 PL-4 osafdtmd[169]: NO Established contact with 'SC-1' 2017-08-15 10:41:15.535 PL-4 osafimmnd[193]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0 2017-08-15 10:41:15.535 PL-4 osafclmna[183]: NO safNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f 2017-08-15 10:41:15.537 PL-4 osafimmnd[193]: NO Fevs count adjusted to 2356 preLoadPid: 0 2017-08-15 10:41:15.616 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING 2017-08-15 10:41:15.718 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING 2017-08-15 10:41:15.819 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING 2017-08-15 10:41:15.819 PL-4 osafimmnd[193]: NO NODE STATE-> IMM_NODE_ISOLATED 2017-08-15 10:41:16.252 PL-4 osafimmnd[193]: NO NODE STATE-> IMM_NODE_W_AVAILABLE 2017-08-15 10:41:16.323 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT 2017-08-15 10:41:16.593 PL-4 osafimmnd[193]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2714 2017-08-15 10:41:16.593 PL-4 osafimmnd[193]: NO RepositoryInitModeT is SA_IMM_INIT_FROM_FILE 2017-08-15 10:41:16.594 PL-4 osafimmnd[193]: WA IMM Access Control mode is DISABLED! 2017-08-15 10:41:16.594 PL-4 osafimmnd[193]: NO Epoch set to 6 in ImmModel 2017-08-15 10:41:16.656 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM_SERVER_READY 2017-08-15 10:41:16.656 PL-4 opensafd[158]: NO Monitoring of IMMND started 2017-08-15 10:41:16.656 PL-4 osafimmnd[193]: NO ImmModel received scAbsenceAllowed 02017-08-15 10:41:16.672 PL-4 osafamfnd[204]: Started2017-08-15 10:41:16.677 PL-4 osafamfnd[204]: NO Sending node up due to NCSMDS_UP 2017-08-15 10:41:16.682 PL-4 osafamfnd[204]: NO saAmfCtDefQuiescingCompleteTimeout for 'safVersion=5.5.5,safCompType=AmfDemo2' initialized with saAmfCtDefCallbackTimeout 2017-08-15 10:41:16.698 PL-4 osafamfnd[204]: NO 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State UNINSTANTIATED => INSTANTIATING2017-08-15 10:41:16.734 PL-4 osafamfwd[215]: Started 2017-08-15 10:41:16.776 PL-4 osafckptnd[225]: Started 2017-08-15 10:41:16.847 PL-4 osafsmfnd[249]: Started2017-08-15 10:41:16.885 PL-4 osafamfnd[204]: NO 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State INSTANTIATING => INSTANTIATED 2017-08-15 10:41:16.894 PL-4 osafamfnd[204]: NO Assigning 'safSi=NoRed4,safApp=OpenSAF' ACTIVE to 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' 2017-08-15 10:41:16.894 PL-4 osafamfnd[204]: NO Assigned 'safSi=NoRed4,safApp=OpenSAF' ACTIVE to 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' 2017-08-15 10:41:16.908 PL-4 opensafd: OpenSAF(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) services successfully started 2017-08-15 10:41:16.912 PL-4 opensafd[134]: Starting OpenSAF Services (Using TCP): *2017-08-15 10:41:16.914 PL-4 systemd[1]: Started OpenSAF daemon.2017-08-15 10:41:16.996 PL-4 systemd[1]: Reached target Multi-User System. 2017-08-15 10:41:16.996 PL-4 systemd[1]: Reached target Graphical Interface. 2017-08-15 10:41:16.996 PL-4 systemd[1]: Failed to reset devices.list on /system.slice/systemd-update-utmp-runlevel.service: Operation not permitted 2017-08-15 10:41:17.011 PL-4 systemd[1]: Starting Update UTMP about System Runlevel Changes... 2017-08-15 10:41:17.017 PL-4 systemd[1]: Started Update UTMP about System Runlevel Changes.2017-08-15 10:41:17.017 PL-4 systemd[1]: Startup finished in 6.981s.2017-08-15 10:41:17.959 PL-4 osafimmnd[193]: NO Implementer disconnected 23 <0, 2010f> (@OpenSafImmReplicatorB) 2017-08-15 10:41:17.964 PL-4 osafimmnd[193]: NO Global discard node received for nodeId:2010f pid:209 2017-08-15 10:41:17.964 PL-4 osafimmnd[193]: NO Implementer disconnected 24 <0, 2010f(down)> (@safAmfService2010f)2017-08-15 10:41:19.208 PL-4 osafdtmd[169]: NO Lost contact with 'SC-1'2017-08-15 10:41:20.542 PL-4 osafdtmd[169]: ER DTM: dtm_node_add failed .node_ip: 10.0.3.101, node_id: 131343 2017-08-15 10:41:20.548 PL-4 osafckptnd[225]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.549 PL-4 osafsmfnd[249]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.550 PL-4 osafamfnd[204]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.551 PL-4 osafamfwd[215]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.552 PL-4 osafimmnd[193]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.552 PL-4 osafclmna[183]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.553 PL-4 osaftransportd[177]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: osafdtmd Process down, Rebooting the node, OwnNodeId = 0, SupervisionTime = 60 2017-08-15 10:41:20.570 PL-4 opensaf_reboot: Rebooting local node; timeout=60On 08/16/2017 10:11 AM, A V Mahesh wrote:Hi Anders, Can you share the logs . -AVM On 8/16/2017 1:33 PM, Anders Widell wrote:The problem seems to happen with a low probability when I reboot a node. Presumably, it can happen when two nodes try to establish TCP connections towards each other at approximately the same time.regards, Anders Widell On 08/16/2017 06:21 AM, A V Mahesh wrote:Hi Anders , On 8/15/2017 6:13 PM, Anders Widell wrote:Is it a pure performance optimization, or is there some potential problem that can happen if two nodes discover each other at the same time and estabilsh TCP connections in both directions? I am seeing error log messages that dtm_node_add() has failed..It is both, in case all node joining cluster simultaneously, it is designated to establish TCP connections by higher node ID ( we are using single connection for both tx & rx) , other wise it will be a race.In witch case you are getting error ? -AVM On 8/15/2017 6:13 PM, Anders Widell wrote:A question to Mahesh: what is the reason behind the following few lines of code:if (initial_discovery_phase == true) { if (node.node_id < dtms_cb->node_id) { TRACE("DTM: received node_id is less than local node_id dropping message cluster_id: %d node_id: %u",node.cluster_id, node.node_id); return sock_desc; } }Is it a pure performance optimization, or is there some potential problem that can happen if two nodes discover each other at the same time and estabilsh TCP connections in both directions? I am seeing error log messages that dtm_node_add() has failed..regards, Anders Widell On 08/15/2017 11:09 AM, Anders Widell wrote:Yes I agree, I think it is more flexible to have a new configuration option for this.Some comments on the patch:1) The new configuration option should specify broadcast interval time in the unit of milliseconds rather than seconds, to make it possible to configure sub-second intervals.2) I prefer to comment out the new option from the config file. A commented out option means that the user wishes to use the default value from the C code - which can potentially change after upgrading to a new version of OpenSAF. If the user wishes to explicitly set the configuration option then he can uncomment the line.3) When testing this patch I got this rather unsettling core dump, caused by "corrupted double-linked list" in malloc(). I will take a closer look at this to see if I can understand what is happening. Could you also take a look? Maybe run osafdtmd with valgrind:Stack trace of core.1502786480.osafdtmd.169.PL-4 [New LWP 173] [New LWP 171] [New LWP 169] [New LWP 174] [Thread debugging using libthread_db enabled]Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".Core was generated by `/usr/local/lib/opensaf/osafdtmd'. Program terminated with signal SIGABRT, Aborted.#0 0x00007fe9ad080428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54[Current thread is 1 (Thread 0x7fe9adaacb00 (LWP 173))] Thread 4 (Thread 0x7fe9ada8cb00 (LWP 174)):#0 0x00007fe9ad145b5d in poll () at ../sysdeps/unix/syscall-template.S:84No locals.#1 0x000055686a3b53ba in poll (__timeout=20000, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46No locals. #2 dtm_intranode_processing () at src/dtm/dtmnd/dtm_intra.c:713 poll_ret_val = 0 j = <optimized out> __FUNCTION__ = "dtm_intranode_processing"#3 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9ada8cb00) at pthread_create.c:333__res = <optimized out> pd = 0x7fe9ada8cb00 now = <optimized out>unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140641617627904, -4592431484730074264, 1, 140721305493231, 140641617628608, 0, 4582197893944318824, 4582198654531672936}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread"#4 0x00007fe9ad15182d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109No locals. Thread 3 (Thread 0x7fe9adacf740 (LWP 169)):#0 __clock_nanosleep (clock_id=clock_id@entry=1, flags=flags@entry=1, req=req@entry=0x7ffc3b6d2130, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48oldstate = 0 r = <optimized out> rem = 0x0 req = 0x7ffc3b6d2130 flags = 1 clock_id = <optimized out>#1 0x00007fe9ad65e034 in osaf_nanosleep (sleep_duration=sleep_duration@entry=0x7ffc3b6d2170) at src/base/osaf_time.c:44wakeup_time = {tv_sec = 6471440, tv_nsec = 210765046} retval = <optimized out>#2 0x00007fe9ad65af67 in ncs_os_task (task=task@entry=0x7ffc3b6d2400, request=request@entry=NCS_OS_TASK_SLEEP) at src/base/os_defs.c:333ts = {tv_sec = 30, tv_nsec = 0} __PRETTY_FUNCTION__ = "ncs_os_task"#3 0x00007fe9ad665143 in ncs_task_sleep (delay_in_ms=<optimized out>) at src/base/sysf_tsk.c:141 task = {info = {create = {i_entry_point = 0x7530, i_name = 0x0, i_priority = 0, i_stack_nbytes = 0, i_policy = 0, i_ep_arg = 0x0, o_handle = 0x0}, release = {i_handle = 0x7530}, detach = {i_handle = 0x7530}, start = {i_handle = 0x7530}, stop = {i_handle = 0x7530}, sleep = {i_delay_in_ms = 30000}, current_handle = {o_handle = 0x7530}}} #4 0x000055686a3b3c65 in main (argc=<optimized out>, argv=0x7ffc3b6d2668) at src/dtm/dtmnd/dtm_main.c:396rc = <optimized out>send_bcast_buffer = "\000:\000\001\000\002\004\017\000\032,\002\061\060.0.3.104", '\000' <repeats 232 times>bcast_buf_len = 58 dis_time_out_usec = 5000000 dis_elapsed_time_usec = 5250000 dtms_cb = 0x55686a5cb0a0 <_dtms_cb> __FUNCTION__ = "main" Thread 2 (Thread 0x7fe9adaccb00 (LWP 171)):#0 0x00007fe9ad145b5d in poll () at ../sysdeps/unix/syscall-template.S:84No locals.#1 0x00007fe9ad65cda0 in poll (__timeout=-1, __nfds=1, __fds=0x7fe9adacc260) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46No locals.#2 osaf_poll_no_timeout (io_fds=0x7fe9adacc260, i_nfds=1) at src/base/osaf_poll.c:31No locals.#3 0x00007fe9ad65cfe5 in osaf_ppoll (io_fds=io_fds@entry=0x7fe9adacc260, i_nfds=i_nfds@entry=1, i_timeout_ts=0x0, i_sigmask=i_sigmask@entry=0x0) at src/base/osaf_poll.c:82start_time = {tv_sec = 0, tv_nsec = 1} time_left_ts = <optimized out> result = -1381186976#4 0x00007fe9ad66416f in ncs_tmr_wait () at src/base/sysf_tmr.c:463rc = <optimized out> inds_rmvd = <optimized out> next_delay = <optimized out> tv = <optimized out> ts_current = {tv_sec = 6471404, tv_nsec = 949388789} ts = {tv_sec = 16777215, tv_nsec = 0} set = {fd = 6, events = 1, revents = 0}#5 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9adaccb00) at pthread_create.c:333__res = <optimized out> pd = 0x7fe9adaccb00 now = <optimized out>unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140641617890048, -4592431484730074264, 1, 140721305492959, 140641617890752, 0, 4582197928304057192, 4582198654531672936}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread"#6 0x00007fe9ad15182d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109No locals. Thread 1 (Thread 0x7fe9adaacb00 (LWP 173)):#0 0x00007fe9ad080428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54resultvar = 0 pid = 169 selftid = 173 #1 0x00007fe9ad08202a in __GI_abort () at abort.c:89 save_stage = 2act = {__sigaction_handler = {sa_handler = 0x66372d3030306464, sa_sigaction = 0x66372d3030306464}, sa_mask = {__val = {3487303855927015781, 2337418197644357680, 3472328296227680304, 3467824696768081952, 7076108730981878304, 3975887029563253302, 3775202183040165478, 8083248238249914416, 3472328296227680288, 2319406834570502192, 2314885530818453552, 2314885530818453536, 2314885530818453536, 7161132921294692384, 3702915907824803179, 3256155514953152354}}, sa_flags = 1667655223, sa_restorer = 0x65}sigs = {__val = {32, 0 <repeats 15 times>}}#2 0x00007fe9ad0c27ea in __libc_message (do_abort=2, fmt=fmt@entry=0x7fe9ad1db2e0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 ap = <error reading variable ap (Attempt to dereference a generic pointer.)>fd = 2 on_2 = <optimized out> list = <optimized out> nlist = <optimized out> cp = <optimized out> written = <optimized out>#3 0x00007fe9ad0cb539 in malloc_printerr (ar_ptr=0x7fe9a4000020, ptr=<optimized out>, str=0x7fe9ad1d8095 "corrupted double-linked list", action=3) at malloc.c:5004buf = "00007fe9a40012c0" cp = <optimized out> ar_ptr = 0x7fe9a4000020 ptr = <optimized out> str = 0x7fe9ad1d8095 "corrupted double-linked list" action = 3#4 _int_free (av=0x7fe9a4000020, p=<optimized out>, have_lock=0) at malloc.c:4012size = 624 fb = <optimized out> nextchunk = <optimized out> nextsize = 736 nextinuse = <optimized out> prevsize = <optimized out> bck = <optimized out> fwd = <optimized out> errstr = 0x0 locked = <optimized out>#5 0x00007fe9ad0ce98c in __GI___libc_free (mem=mem@entry=0x7fe9a40010c0) at malloc.c:2966ar_ptr = <optimized out> p = 0x7fe9a40010b0 hook = <optimized out>#6 0x000055686a3bcbf2 in dtm_process_connect (dtms_cb=dtms_cb@entry=0x55686a5cb0a0 <_dtms_cb>, node_ip=node_ip@entry=0x7fe9adaabd90 "10.0.3.101", data=data@entry=0x7fe9adaabed0 "", len=len@entry=56) at src/dtm/dtmnd/dtm_node_sockets.c:1436foreign_port = 6700 sock_desc = -1node = {cluster_id = 1, node_id = 131343, node_name = '\000' <repeats 255 times>, node_ip = "10.0.3.101", '\000' <repeats 35 times>, i_addr_family = DTM_IP_ADDR_TYPE_IPV4, comm_socket = 0, pat_nodeid = {bit = 0, left = 0x0, right = 0x0, key_info = 0x0}, pat_comm_socket = {bit = 0, left = 0x0, right = 0x0, key_info = 0x0}, pat_ip_address = {bit = 0, left = 0x0, right = 0x0, key_info = 0x0}, comm_status = 0 '\000', node_lock = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, count = 0, tid = 0}, mbx = 0, mbx_fd = 0, msgs_hdr = 0x0, msgs_tail = 0x0, bytes_tb_read = 0, buff_total_len = 0, len_buff = "\000", num_by_read_for_len_buff = 0 '\000', buffer = 0x0}new_node = 0x7fe9a40010c0 buffer = 0x7fe9adaabedc "10.0.3.101" mcast_flag = <optimized out> __FUNCTION__ = "dtm_process_connect"#7 0x000055686a3b859a in node_discovery_process (arg=<optimized out>) at src/dtm/dtmnd/dtm_node.c:648new_sd = -1 fd_check = 1 __FUNCTION__ = "node_discovery_process" poll_ret = <optimized out> end_server = 0 compress_array = <optimized out> close_conn = 0 dtms_cb = 0x55686a5cb0a0 <_dtms_cb> current_size = <optimized out> i = <optimized out> j = <optimized out>inbuf = "\000:\000\001\000\002\001\017\000\032,\002\061\060.0.3.101", '\000' <repeats 1001 times>data1 = 0x7fe9adaabed2 "" recd_bytes = <optimized out> recd_buf_len = <optimized out> node_info_buffer_len = 20node_info_hrd = "\000\022V\022\064V\001\001\000\002\004\017\000\000\000\004PL-4", '\000' <repeats 250 times> node_ip = "10.0.3.101\000\000\000\000\000\000\020e<jhU\000\000\001", '\000' <repeats 20 times> #8 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9adaacb00) at pthread_create.c:333__res = <optimized out> pd = 0x7fe9adaacb00 now = <optimized out>unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140641617758976, -4592431484730074264, 1, 140721305493695, 140641617759680, 0, 4582197876764449640, 4582198654531672936}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread"#9 0x00007fe9ad15182d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109No locals.54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.'regards, Anders Widell On 08/14/2017 03:21 PM, Alex Jones wrote:My feeling is that we should keep the two phases separate. If our new, scalable way doesn't make it in for a little while, at least we have this, and it is configurable.Alex On 08/14/2017 07:11 AM, Anders Widell wrote:------------------------------------------------------------------------NOTICE: This email was received from an EXTERNAL sender------------------------------------------------------------------------The idea is to still have an "initial discovery phase" where webroadcasts messages at a high rate, and then to continue broadcasting at a lower rate after the initial discovery phase. If we have only one configuration option and enable this new continuous broadcast mode by setting that option to zero as you suggest, we would need to hard-codethe length of the initial phase (to five seconds?) as well as theinterval between broadcasts (30 seconds?) after the initial phase. Since we anyhow intend to implement a more sophisticated solution where only one of the nodes in the cluster continues to send broadcast messages, Ithink it could be OK to hard-code. What do you think, Alex? regards, Anders Widell On 08/14/2017 07:56 AM, A V Mahesh wrote: > Hi Alex Jones, > > I don't think you required an additional configuration > (CONT_BCAST_INT) for continuing dtm discovery, > you can utilize the DTM_INI_DIS_TIMEOUT_SECS=5 ( > dtm_discovery_timeout_secs ), it self like : > Just by adding additional comment saying : > > "# Comment the next line (DTM_INI_DIS_TIMEOUT_SECS) to enable > continuing dtm discovery" >> ==========================================================================================>> # dtm_discovery_timeout_secs: This is the time dtmc should be allowed> to wait> # for a service script finish executing. If the service is not finished> # within this time dtmc will send a timeout message to DTMSv > # Mandatory > DTM_INI_DIS_TIMEOUT_SECS=5> ==========================================================================================> >> And while reading configuration DTM_INI_DIS_TIMEOUT_SECS is ZERO,> change the logic to continuing dtm discovery > > -AVM > > On 8/11/2017 8:51 PM, Alex Jones wrote:>> If TCP closes a socket (due to TCP_USER_TIMEOUT or some other issue)>> because>> of a network problem, and DTM declares that the node has exited the>> cluster,>> the node never attempts to come back into the cluster if it does not>> reboot. >> >> Initial discovery broadcasts stop after 5 seconds. >>>> Discovery broadcasts should continue after the initial discovery>> phase at a >> reduced rate. >> --- >> src/dtm/dtmnd/dtm_cb.h | 1 + >> src/dtm/dtmnd/dtm_main.c | 56 >> +++++++++++++++++++++++++---------------- >> src/dtm/dtmnd/dtm_read_config.c | 17 +++++++++++++ >> src/dtm/dtmnd/dtmd.conf | 8 ++++-- >> 4 files changed, 58 insertions(+), 24 deletions(-) >> >> diff --git a/src/dtm/dtmnd/dtm_cb.h b/src/dtm/dtmnd/dtm_cb.h >> index f7c48c4..c5b5890 100644 >> --- a/src/dtm/dtmnd/dtm_cb.h >> +++ b/src/dtm/dtmnd/dtm_cb.h >> @@ -80,6 +80,7 @@ typedef struct dtm_internode_cb { >> DTM_IP_ADDR_TYPE i_addr_family; /* Indicates V4 or V6 */ >> uint8_t mcast_flag; /* Indicates mcast */ >> int32_t initial_dis_timeout; >> + int32_t cont_bcast_int; >> int64_t bcast_msg_freq;>> NCS_PATRICIA_TREE nodeid_tree; /* NODE_DB information of Nodes */ >> NCS_PATRICIA_TREE comm_sock_tree; /* NODE_DB information of Nodes */ >> diff --git a/src/dtm/dtmnd/dtm_main.c b/src/dtm/dtmnd/dtm_main.c>> index 724ce93..9f6d7cc 100644 >> --- a/src/dtm/dtmnd/dtm_main.c >> +++ b/src/dtm/dtmnd/dtm_main.c >> @@ -233,6 +233,30 @@ err: >> return rc; >> }>> +static uint32_t dtm_send_bcast_mcast(DTM_INTERNODE_CB *dtms_cb,>> + void *send_bcast_buffer, >> + size_t bcast_buf_len) >> +{ >> + uint32_t rc; >> + >> + TRACE_ENTER(); >> + >> + if (dtms_cb->mcast_flag == true) { >> + rc = dtm_dgram_sendto_mcast(dtms_cb, send_bcast_buffer, >> bcast_buf_len); >> + if (NCSCC_RC_SUCCESS != rc) { >> + LOG_ER("DTM: dtm_dgram_sendto_mcast Failed rc : %d \n", >> rc); >> + } >> + } else { >> + rc = dtm_dgram_sendto_bcast(dtms_cb, send_bcast_buffer, >> bcast_buf_len); >> + if (NCSCC_RC_SUCCESS != rc) { >> + LOG_ER("DTM: dtm_dgram_sendto_bcast Failed rc : %d \n", >> rc); >> + } >> + } >> + >> + TRACE_LEAVE(); >> + return rc; >> +} >> + >> /** >> * DTM process main function >> * >> @@ -357,26 +381,7 @@ int main(int argc, char *argv[]) >> /* Broadcast msg string in datagram to clients every 250 m >> * seconds */ >> - if (dtms_cb->mcast_flag == true) { >> - >> - rc = dtm_dgram_sendto_mcast(dtms_cb, send_bcast_buffer, >> - bcast_buf_len); >> - if (NCSCC_RC_SUCCESS != rc) { >> - LOG_ER( >> - "DTM: dtm_dgram_sendto_mcast Failed rc : %d \n", >> - rc); >> - } >> - >> - } else { >> - >> - rc = dtm_dgram_sendto_bcast(dtms_cb, send_bcast_buffer, >> - bcast_buf_len); >> - if (NCSCC_RC_SUCCESS != rc) { >> - LOG_ER( >> - "DTM: dtm_dgram_sendto_bcast Failed rc : %d \n", >> - rc); >> - } >> - } >> + dtm_send_bcast_mcast(dtms_cb, send_bcast_buffer, >> bcast_buf_len); >> dis_elapsed_time_usec = >> dis_elapsed_time_usec + (dtms_cb->bcast_msg_freq * 1000); >> @@ -387,8 +392,15 @@ int main(int argc, char *argv[])>> /*************************************************************/>> initial_discovery_phase = false; >> while (1) { >> - m_NCS_TASK_SLEEP(0xfffffff0); >> - /* m_NCS_TASK_SLEEP(30000); */ >> + if (dtms_cb->cont_bcast_int) { >> + m_NCS_TASK_SLEEP(dtms_cb->cont_bcast_int * 1000); >> + /* periodically send a broadcast */ >> + dtm_send_bcast_mcast(dtms_cb, >> + send_bcast_buffer, >> + bcast_buf_len); >> + } else { >> + m_NCS_TASK_SLEEP(0xfffffff0); >> + } >> } >> done1: >> LOG_ER("DTM : dtm_destroy_service_discovery_task exiting..."); >> diff --git a/src/dtm/dtmnd/dtm_read_config.c >> b/src/dtm/dtmnd/dtm_read_config.c >> index 85f34f4..0db21e8 100644 >> --- a/src/dtm/dtmnd/dtm_read_config.c >> +++ b/src/dtm/dtmnd/dtm_read_config.c >> @@ -39,6 +39,7 @@ extern uint32_t intranode_max_processes; >> #define KEEPALIVE_PROBES 9 >> #define DIS_TIME_OUT 5 >> #define BCAST_FRE 250 >> +#define CONT_BCAST_INT 30 >> #define USER_TIMEOUT 1500 // 1.5 sec to match other transport >> const char *IN6ADDR_LINK_LOCAL =>> @@ -108,6 +109,8 @@ void dtm_print_config(DTM_INTERNODE_CB *config)>> TRACE(" %d", config->initial_dis_timeout); >> TRACE(" DTM_BCAST_FRE_MSECS: "); >> TRACE(" %" PRId64 "", config->bcast_msg_freq); >> + TRACE(" DTM_CONT_BCAST_INT: "); >> + TRACE(" %d", config->cont_bcast_int); >> TRACE(" DTM_SOCK_SND_BUF_SIZE: "); >> TRACE(" %d", config->sock_sndbuf_size); >> TRACE(" DTM_SOCK_RCV_BUF_SIZE: ");>> @@ -277,6 +280,7 @@ int dtm_read_config(DTM_INTERNODE_CB *config,>> char *dtm_config_file) >> config->comm_keepidle_time = USER_TIMEOUT; >> config->i_addr_family = DTM_IP_ADDR_TYPE_IPV4; >> config->bcast_msg_freq = BCAST_FRE; >> + config->cont_bcast_int = CONT_BCAST_INT; >> config->initial_dis_timeout = DIS_TIME_OUT; >> config->sock_sndbuf_size = 0; >> config->sock_rcvbuf_size = 0;>> @@ -431,6 +435,19 @@ int dtm_read_config(DTM_INTERNODE_CB *config,>> char *dtm_config_file) >> tag = 0; >> tag_len = 0; >> + >> + } >> + if (strncmp(line, "DTM_CONTINUOUS_BCAST_INT=", >> strlen("DTM_CONTINUOUS_BCAST_INT=")) == 0) { >> + tag_len = strlen("DTM_CONTINUOUS_BCAST_INT="); >> + config->cont_bcast_int = atoi(&line[tag_len]); >> + if (config->cont_bcast_int < 0) { >> + LOG_ER("DTM:cont_bcast_int must be 0 or greater"); >> + fclose(dtm_conf_file); >> + return -1; >> + } >> + >> + tag = 0; >> + tag_len = 0; >> } >> if (strncmp(line, "DTM_BCAST_FRE_MSECS=", >> strlen("DTM_BCAST_FRE_MSECS=")) == 0) { >> diff --git a/src/dtm/dtmnd/dtmd.conf b/src/dtm/dtmnd/dtmd.conf >> index c93b340..51bebb2 100644 >> --- a/src/dtm/dtmnd/dtmd.conf >> +++ b/src/dtm/dtmnd/dtmd.conf >> @@ -38,10 +38,14 @@ DTM_UDP_BCAST_SND_PORT=6800 >> DTM_UDP_BCAST_REV_PORT=6900 >> # >> -# udp_broadcast_port: The UDP port that the DTMSv listens on >> -# Mandatory>> +# bcast_freq: frequencey (in milliseconds) between broadcasts during>> initial >> +# discovery >> DTM_BCAST_FRE_MSECS=250>> +# cont_bcast_int: interval (in seconds) between broadcasts after>> initial >> +# discovery has completed >> +DTM_CONTINUOUS_BCAST_INT=30 >> + >> # dtm_discovery_timeout_secs: This is the time dtmc should be >> allowed to wait >> # for a service script finish executing. If the service is not >> finished >> # within this time dtmc will send a timeout message to DTMSv >------------------------------------------------------------------------------Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=ZCqQT5AxNcsZPEiigY-zLMKSGHl4Jq5NCGA5stpZzA0&s=I1-eCPraHZz8gq4zxJksmaF4atfUxhUyRPy0lpU7sJI&e= _______________________________________________Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.nethttps://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=ZCqQT5AxNcsZPEiigY-zLMKSGHl4Jq5NCGA5stpZzA0&s=8tq9JPVRftg9M1HD9iykQrolo2m1QpDh1nMHKhEiqv0&e=------------------------------------------------------------------------------Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=n5aVHtvfxtC6zsEKpgkA-hCtkAE2stJZszSe5WSzAM4&s=eqDdjKotS1P_UhOd6zkhS3mURCrLAlrJLG2_ppBQpfw&e= _______________________________________________Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.nethttps://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=n5aVHtvfxtC6zsEKpgkA-hCtkAE2stJZszSe5WSzAM4&s=6h_tfc05V_JphyVVh4S2gkPvsIjqZPXLwz9nkmQI-G4&e=
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel