Hi Anders,

I did try to reproduce  quickly , not successful in reproducing ,
It looks to me some network interfaces boot order and openSAF start related issue,
I will try to reproduce the problem and update you.

================================================================================
On 8/16/2017 2:02 PM, Anders Widell wrote:
> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: dtm_node_add failed .node_ip: 10.0.3.104, node_id: 0
> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: accept() failed
> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: NO Lost contact with 'PL-4'
> 2017-08-15 10:41:21.319 SC-1 kernel:[6471411.217909] IPv6: ADDRCONF(NETDEV_UP): veth1BBPKW: link is not ready
================================================================================

-AVM


On 8/16/2017 2:02 PM, Anders Widell wrote:
Here are the logs from SC-1. It seems to be starting at the same time as PL-4:

2017-08-15 10:41:20.262 SC-1 opensafd: Starting OpenSAF Services(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) (Using TCP) 2017-08-15 10:41:20.263 SC-1 opensafd[113]: /etc/init.d/opensafd: 113: /etc/init.d/opensafd: cannot create /proc/sys/kernel/core_pattern: Read-only file system 2017-08-15 10:41:20.289 SC-1 osafdtmd[156]: mkfifo already exists: /var/lib/opensaf/osafdtmd.fifo File exists
2017-08-15 10:41:20.291 SC-1 osafdtmd[156]: Started
2017-08-15 10:41:20.296 SC-1 osaftransportd[166]: Started
2017-08-15 10:41:20.307 SC-1 osafclmna[170]: Started
2017-08-15 10:41:20.320 SC-1 osafrded[179]: Started
2017-08-15 10:41:20.337 SC-1 osaffmd[188]: Started
2017-08-15 10:41:20.338 SC-1 osaffmd[188]: NO Remote fencing is disabled
2017-08-15 10:41:20.354 SC-1 osafimmd[198]: Started
2017-08-15 10:41:20.370 SC-1 osafimmnd[209]: Started
2017-08-15 10:41:20.371 SC-1 osafimmnd[209]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added) 2017-08-15 10:41:20.396 SC-1 opensafd[145]: NO Monitoring of TRANSPORT started 2017-08-15 10:41:20.507 SC-1 osafclmna[170]: NO Starting to promote this node to a system controller
2017-08-15 10:41:20.507 SC-1 osafrded[179]: NO Requesting ACTIVE role
2017-08-15 10:41:20.507 SC-1 osafrded[179]: NO RDE role set to Undefined
2017-08-15 10:41:20.542 SC-1 osafdtmd[156]: NO Established contact with 'PL-4' 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: dtm_node_add failed .node_ip: 10.0.3.104, node_id: 0
2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: accept() failed
2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: NO Lost contact with 'PL-4'
2017-08-15 10:41:21.319 SC-1 kernel:[6471411.217909] IPv6: ADDRCONF(NETDEV_UP): veth1BBPKW: link is not ready 2017-08-15 10:41:22.509 SC-1 osafrded[179]: NO Running '/usr/local/lib/opensaf/opensaf_sc_active' with 0 argument(s) 2017-08-15 10:41:22.534 SC-1 osafrded[179]: NO Switched to ACTIVE from Undefined 2017-08-15 10:41:22.534 SC-1 osaffmd[188]: NO Starting activation supervision: 300000ms 2017-08-15 10:41:22.535 SC-1 osafimmnd[209]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:22.535 SC-1 osafrded[179]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:22.535 SC-1 osafclmna[170]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:22.536 SC-1 osaftransportd[166]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: osafdtmd Process down, Rebooting the node, OwnNodeId = 0, SupervisionTime = 60 2017-08-15 10:41:22.546 SC-1 opensaf_reboot: Rebooting local node; timeout=60


On 08/16/2017 10:22 AM, Anders Widell wrote:
It is quite easy to reproduce the "dtm_node_add failed" log message, which leads to a reboot. I have only seen the core dump once, though.

Here is the log from the time when I got the core dump:

2017-08-15 10:41:15.218 PL-4 systemd[1]: Started LSB: ping controllers.
2017-08-15 10:41:15.219 PL-4 systemd[1]: Failed to reset devices.list on /system.slice/opensafd.service: Operation not permitted
2017-08-15 10:41:15.219 PL-4 systemd[1]: Starting OpenSAF daemon...
2017-08-15 10:41:15.219 PL-4 systemd[1]: Failed to reset devices.list on /system.slice/plsync.service: Operation not permitted 2017-08-15 10:41:15.253 PL-4 opensafd: Starting OpenSAF Services(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) (Using TCP) 2017-08-15 10:41:15.254 PL-4 opensafd[134]: /etc/init.d/opensafd: 113: /etc/init.d/opensafd: cannot create /proc/sys/kernel/core_pattern: Read-only file system 2017-08-15 10:41:15.279 PL-4 osafdtmd[169]: mkfifo already exists: /var/lib/opensaf/osafdtmd.fifo File exists
2017-08-15 10:41:15.280 PL-4 osafdtmd[169]: Started
2017-08-15 10:41:15.284 PL-4 osaftransportd[177]: Started
2017-08-15 10:41:15.297 PL-4 osafclmna[183]: Started
2017-08-15 10:41:15.312 PL-4 osafimmnd[193]: Started
2017-08-15 10:41:15.314 PL-4 osafimmnd[193]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added) 2017-08-15 10:41:15.383 PL-4 opensafd[158]: NO Monitoring of TRANSPORT started 2017-08-15 10:41:15.384 PL-4 opensafd[158]: NO Monitoring of CLMNA started 2017-08-15 10:41:15.498 PL-4 osafclmna[183]: NO Starting to promote this node to a system controller 2017-08-15 10:41:15.532 PL-4 osafdtmd[169]: NO Established contact with 'PL-3' 2017-08-15 10:41:15.533 PL-4 osafdtmd[169]: NO Established contact with 'SC-2' 2017-08-15 10:41:15.534 PL-4 osafdtmd[169]: NO Established contact with 'SC-1' 2017-08-15 10:41:15.535 PL-4 osafimmnd[193]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0 2017-08-15 10:41:15.535 PL-4 osafclmna[183]: NO safNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f 2017-08-15 10:41:15.537 PL-4 osafimmnd[193]: NO Fevs count adjusted to 2356 preLoadPid: 0 2017-08-15 10:41:15.616 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING 2017-08-15 10:41:15.718 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING 2017-08-15 10:41:15.819 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING 2017-08-15 10:41:15.819 PL-4 osafimmnd[193]: NO NODE STATE-> IMM_NODE_ISOLATED 2017-08-15 10:41:16.252 PL-4 osafimmnd[193]: NO NODE STATE-> IMM_NODE_W_AVAILABLE 2017-08-15 10:41:16.323 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT 2017-08-15 10:41:16.593 PL-4 osafimmnd[193]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2714 2017-08-15 10:41:16.593 PL-4 osafimmnd[193]: NO RepositoryInitModeT is SA_IMM_INIT_FROM_FILE 2017-08-15 10:41:16.594 PL-4 osafimmnd[193]: WA IMM Access Control mode is DISABLED! 2017-08-15 10:41:16.594 PL-4 osafimmnd[193]: NO Epoch set to 6 in ImmModel 2017-08-15 10:41:16.656 PL-4 osafimmnd[193]: NO SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM_SERVER_READY 2017-08-15 10:41:16.656 PL-4 opensafd[158]: NO Monitoring of IMMND started 2017-08-15 10:41:16.656 PL-4 osafimmnd[193]: NO ImmModel received scAbsenceAllowed 0
2017-08-15 10:41:16.672 PL-4 osafamfnd[204]: Started
2017-08-15 10:41:16.677 PL-4 osafamfnd[204]: NO Sending node up due to NCSMDS_UP 2017-08-15 10:41:16.682 PL-4 osafamfnd[204]: NO saAmfCtDefQuiescingCompleteTimeout for 'safVersion=5.5.5,safCompType=AmfDemo2' initialized with saAmfCtDefCallbackTimeout 2017-08-15 10:41:16.698 PL-4 osafamfnd[204]: NO 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State UNINSTANTIATED => INSTANTIATING
2017-08-15 10:41:16.734 PL-4 osafamfwd[215]: Started
2017-08-15 10:41:16.776 PL-4 osafckptnd[225]: Started
2017-08-15 10:41:16.847 PL-4 osafsmfnd[249]: Started
2017-08-15 10:41:16.885 PL-4 osafamfnd[204]: NO 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State INSTANTIATING => INSTANTIATED 2017-08-15 10:41:16.894 PL-4 osafamfnd[204]: NO Assigning 'safSi=NoRed4,safApp=OpenSAF' ACTIVE to 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' 2017-08-15 10:41:16.894 PL-4 osafamfnd[204]: NO Assigned 'safSi=NoRed4,safApp=OpenSAF' ACTIVE to 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' 2017-08-15 10:41:16.908 PL-4 opensafd: OpenSAF(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) services successfully started 2017-08-15 10:41:16.912 PL-4 opensafd[134]: Starting OpenSAF Services (Using TCP): *
2017-08-15 10:41:16.914 PL-4 systemd[1]: Started OpenSAF daemon.
2017-08-15 10:41:16.996 PL-4 systemd[1]: Reached target Multi-User System. 2017-08-15 10:41:16.996 PL-4 systemd[1]: Reached target Graphical Interface. 2017-08-15 10:41:16.996 PL-4 systemd[1]: Failed to reset devices.list on /system.slice/systemd-update-utmp-runlevel.service: Operation not permitted 2017-08-15 10:41:17.011 PL-4 systemd[1]: Starting Update UTMP about System Runlevel Changes... 2017-08-15 10:41:17.017 PL-4 systemd[1]: Started Update UTMP about System Runlevel Changes.
2017-08-15 10:41:17.017 PL-4 systemd[1]: Startup finished in 6.981s.
2017-08-15 10:41:17.959 PL-4 osafimmnd[193]: NO Implementer disconnected 23 <0, 2010f> (@OpenSafImmReplicatorB) 2017-08-15 10:41:17.964 PL-4 osafimmnd[193]: NO Global discard node received for nodeId:2010f pid:209 2017-08-15 10:41:17.964 PL-4 osafimmnd[193]: NO Implementer disconnected 24 <0, 2010f(down)> (@safAmfService2010f)
2017-08-15 10:41:19.208 PL-4 osafdtmd[169]: NO Lost contact with 'SC-1'
2017-08-15 10:41:20.542 PL-4 osafdtmd[169]: ER DTM: dtm_node_add failed .node_ip: 10.0.3.101, node_id: 131343 2017-08-15 10:41:20.548 PL-4 osafckptnd[225]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.549 PL-4 osafsmfnd[249]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.550 PL-4 osafamfnd[204]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.551 PL-4 osafamfwd[215]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.552 PL-4 osafimmnd[193]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.552 PL-4 osafclmna[183]: MDTM:SOCKET recd_bytes :0, conn lost with dh server 2017-08-15 10:41:20.553 PL-4 osaftransportd[177]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: osafdtmd Process down, Rebooting the node, OwnNodeId = 0, SupervisionTime = 60 2017-08-15 10:41:20.570 PL-4 opensaf_reboot: Rebooting local node; timeout=60


On 08/16/2017 10:11 AM, A V Mahesh wrote:
Hi Anders,

Can you share the logs .

-AVM


On 8/16/2017 1:33 PM, Anders Widell wrote:
The problem seems to happen with a low probability when I reboot a node. Presumably, it can happen when two nodes try to establish TCP connections towards each other at approximately the same time.

regards,

Anders Widell


On 08/16/2017 06:21 AM, A V Mahesh wrote:
Hi Anders ,

On 8/15/2017 6:13 PM, Anders Widell wrote:
Is it a pure performance optimization, or is there some potential problem that can happen if two nodes discover each other at the same time and estabilsh TCP connections in both directions? I am seeing error log messages that dtm_node_add() has failed..
It is both, in case all node joining cluster simultaneously, it is designated to establish TCP connections by higher node ID ( we are using single connection for both tx & rx) , other wise it will be a race.

In witch case you are getting error ?

-AVM

On 8/15/2017 6:13 PM, Anders Widell wrote:
A question to Mahesh: what is the reason behind the following few lines of code:

    if (initial_discovery_phase == true) {
        if (node.node_id < dtms_cb->node_id) {
            TRACE(
"DTM: received node_id is less than local node_id dropping message cluster_id: %d node_id: %u",
                node.cluster_id, node.node_id);
            return sock_desc;
        }
    }

Is it a pure performance optimization, or is there some potential problem that can happen if two nodes discover each other at the same time and estabilsh TCP connections in both directions? I am seeing error log messages that dtm_node_add() has failed..

regards,
Anders Widell

On 08/15/2017 11:09 AM, Anders Widell wrote:
Yes I agree, I think it is more flexible to have a new configuration option for this.

Some comments on the patch:

1) The new configuration option should specify broadcast interval time in the unit of milliseconds rather than seconds, to make it possible to configure sub-second intervals.

2) I prefer to comment out the new option from the config file. A commented out option means that the user wishes to use the default value from the C code - which can potentially change after upgrading to a new version of OpenSAF. If the user wishes to explicitly set the configuration option then he can uncomment the line.

3) When testing this patch I got this rather unsettling core dump, caused by "corrupted double-linked list" in malloc(). I will take a closer look at this to see if I can understand what is happening. Could you also take a look? Maybe run osafdtmd with valgrind:

Stack trace of core.1502786480.osafdtmd.169.PL-4
[New LWP 173]
[New LWP 171]
[New LWP 169]
[New LWP 174]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/lib/opensaf/osafdtmd'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fe9ad080428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
[Current thread is 1 (Thread 0x7fe9adaacb00 (LWP 173))]

Thread 4 (Thread 0x7fe9ada8cb00 (LWP 174)):
#0 0x00007fe9ad145b5d in poll () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x000055686a3b53ba in poll (__timeout=20000, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
No locals.
#2  dtm_intranode_processing () at src/dtm/dtmnd/dtm_intra.c:713
        poll_ret_val = 0
        j = <optimized out>
        __FUNCTION__ = "dtm_intranode_processing"
#3 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9ada8cb00) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7fe9ada8cb00
        now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140641617627904, -4592431484730074264, 1, 140721305493231, 140641617628608, 0, 4582197893944318824, 4582198654531672936}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#4 0x00007fe9ad15182d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 3 (Thread 0x7fe9adacf740 (LWP 169)):
#0 __clock_nanosleep (clock_id=clock_id@entry=1, flags=flags@entry=1, req=req@entry=0x7ffc3b6d2130, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
        oldstate = 0
        r = <optimized out>
        rem = 0x0
        req = 0x7ffc3b6d2130
        flags = 1
        clock_id = <optimized out>
#1 0x00007fe9ad65e034 in osaf_nanosleep (sleep_duration=sleep_duration@entry=0x7ffc3b6d2170) at src/base/osaf_time.c:44
        wakeup_time = {tv_sec = 6471440, tv_nsec = 210765046}
        retval = <optimized out>
#2 0x00007fe9ad65af67 in ncs_os_task (task=task@entry=0x7ffc3b6d2400, request=request@entry=NCS_OS_TASK_SLEEP) at src/base/os_defs.c:333
        ts = {tv_sec = 30, tv_nsec = 0}
        __PRETTY_FUNCTION__ = "ncs_os_task"
#3 0x00007fe9ad665143 in ncs_task_sleep (delay_in_ms=<optimized out>) at src/base/sysf_tsk.c:141 task = {info = {create = {i_entry_point = 0x7530, i_name = 0x0, i_priority = 0, i_stack_nbytes = 0, i_policy = 0, i_ep_arg = 0x0, o_handle = 0x0}, release = {i_handle = 0x7530}, detach = {i_handle = 0x7530}, start = {i_handle = 0x7530}, stop = {i_handle = 0x7530}, sleep = {i_delay_in_ms = 30000}, current_handle = {o_handle = 0x7530}}} #4 0x000055686a3b3c65 in main (argc=<optimized out>, argv=0x7ffc3b6d2668) at src/dtm/dtmnd/dtm_main.c:396
        rc = <optimized out>
send_bcast_buffer = "\000:\000\001\000\002\004\017\000\032,\002\061\060.0.3.104", '\000' <repeats 232 times>
        bcast_buf_len = 58
        dis_time_out_usec = 5000000
        dis_elapsed_time_usec = 5250000
        dtms_cb = 0x55686a5cb0a0 <_dtms_cb>
        __FUNCTION__ = "main"

Thread 2 (Thread 0x7fe9adaccb00 (LWP 171)):
#0 0x00007fe9ad145b5d in poll () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007fe9ad65cda0 in poll (__timeout=-1, __nfds=1, __fds=0x7fe9adacc260) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
No locals.
#2 osaf_poll_no_timeout (io_fds=0x7fe9adacc260, i_nfds=1) at src/base/osaf_poll.c:31
No locals.
#3 0x00007fe9ad65cfe5 in osaf_ppoll (io_fds=io_fds@entry=0x7fe9adacc260, i_nfds=i_nfds@entry=1, i_timeout_ts=0x0, i_sigmask=i_sigmask@entry=0x0) at src/base/osaf_poll.c:82
        start_time = {tv_sec = 0, tv_nsec = 1}
        time_left_ts = <optimized out>
        result = -1381186976
#4 0x00007fe9ad66416f in ncs_tmr_wait () at src/base/sysf_tmr.c:463
        rc = <optimized out>
        inds_rmvd = <optimized out>
        next_delay = <optimized out>
        tv = <optimized out>
        ts_current = {tv_sec = 6471404, tv_nsec = 949388789}
        ts = {tv_sec = 16777215, tv_nsec = 0}
        set = {fd = 6, events = 1, revents = 0}
#5 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9adaccb00) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7fe9adaccb00
        now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140641617890048, -4592431484730074264, 1, 140721305492959, 140641617890752, 0, 4582197928304057192, 4582198654531672936}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6 0x00007fe9ad15182d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 1 (Thread 0x7fe9adaacb00 (LWP 173)):
#0 0x00007fe9ad080428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
        resultvar = 0
        pid = 169
        selftid = 173
#1  0x00007fe9ad08202a in __GI_abort () at abort.c:89
        save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x66372d3030306464, sa_sigaction = 0x66372d3030306464}, sa_mask = {__val = {3487303855927015781, 2337418197644357680, 3472328296227680304, 3467824696768081952, 7076108730981878304, 3975887029563253302, 3775202183040165478, 8083248238249914416, 3472328296227680288, 2319406834570502192, 2314885530818453552, 2314885530818453536, 2314885530818453536, 7161132921294692384, 3702915907824803179, 3256155514953152354}}, sa_flags = 1667655223, sa_restorer = 0x65}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007fe9ad0c27ea in __libc_message (do_abort=2, fmt=fmt@entry=0x7fe9ad1db2e0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 ap = <error reading variable ap (Attempt to dereference a generic pointer.)>
        fd = 2
        on_2 = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#3 0x00007fe9ad0cb539 in malloc_printerr (ar_ptr=0x7fe9a4000020, ptr=<optimized out>, str=0x7fe9ad1d8095 "corrupted double-linked list", action=3) at malloc.c:5004
        buf = "00007fe9a40012c0"
        cp = <optimized out>
        ar_ptr = 0x7fe9a4000020
        ptr = <optimized out>
        str = 0x7fe9ad1d8095 "corrupted double-linked list"
        action = 3
#4 _int_free (av=0x7fe9a4000020, p=<optimized out>, have_lock=0) at malloc.c:4012
        size = 624
        fb = <optimized out>
        nextchunk = <optimized out>
        nextsize = 736
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        errstr = 0x0
        locked = <optimized out>
#5 0x00007fe9ad0ce98c in __GI___libc_free (mem=mem@entry=0x7fe9a40010c0) at malloc.c:2966
        ar_ptr = <optimized out>
        p = 0x7fe9a40010b0
        hook = <optimized out>
#6 0x000055686a3bcbf2 in dtm_process_connect (dtms_cb=dtms_cb@entry=0x55686a5cb0a0 <_dtms_cb>, node_ip=node_ip@entry=0x7fe9adaabd90 "10.0.3.101", data=data@entry=0x7fe9adaabed0 "", len=len@entry=56) at src/dtm/dtmnd/dtm_node_sockets.c:1436
        foreign_port = 6700
        sock_desc = -1
node = {cluster_id = 1, node_id = 131343, node_name = '\000' <repeats 255 times>, node_ip = "10.0.3.101", '\000' <repeats 35 times>, i_addr_family = DTM_IP_ADDR_TYPE_IPV4, comm_socket = 0, pat_nodeid = {bit = 0, left = 0x0, right = 0x0, key_info = 0x0}, pat_comm_socket = {bit = 0, left = 0x0, right = 0x0, key_info = 0x0}, pat_ip_address = {bit = 0, left = 0x0, right = 0x0, key_info = 0x0}, comm_status = 0 '\000', node_lock = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, count = 0, tid = 0}, mbx = 0, mbx_fd = 0, msgs_hdr = 0x0, msgs_tail = 0x0, bytes_tb_read = 0, buff_total_len = 0, len_buff = "\000", num_by_read_for_len_buff = 0 '\000', buffer = 0x0}
        new_node = 0x7fe9a40010c0
        buffer = 0x7fe9adaabedc "10.0.3.101"
        mcast_flag = <optimized out>
        __FUNCTION__ = "dtm_process_connect"
#7 0x000055686a3b859a in node_discovery_process (arg=<optimized out>) at src/dtm/dtmnd/dtm_node.c:648
        new_sd = -1
        fd_check = 1
        __FUNCTION__ = "node_discovery_process"
        poll_ret = <optimized out>
        end_server = 0
        compress_array = <optimized out>
        close_conn = 0
        dtms_cb = 0x55686a5cb0a0 <_dtms_cb>
        current_size = <optimized out>
        i = <optimized out>
        j = <optimized out>
inbuf = "\000:\000\001\000\002\001\017\000\032,\002\061\060.0.3.101", '\000' <repeats 1001 times>
        data1 = 0x7fe9adaabed2 ""
        recd_bytes = <optimized out>
        recd_buf_len = <optimized out>
        node_info_buffer_len = 20
node_info_hrd = "\000\022V\022\064V\001\001\000\002\004\017\000\000\000\004PL-4", '\000' <repeats 250 times> node_ip = "10.0.3.101\000\000\000\000\000\000\020e<jhU\000\000\001", '\000' <repeats 20 times> #8 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9adaacb00) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7fe9adaacb00
        now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140641617758976, -4592431484730074264, 1, 140721305493695, 140641617759680, 0, 4582197876764449640, 4582198654531672936}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#9 0x00007fe9ad15182d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.'

regards,

Anders Widell


On 08/14/2017 03:21 PM, Alex Jones wrote:

My feeling is that we should keep the two phases separate. If our new, scalable way doesn't make it in for a little while, at least we have this, and it is configurable.

Alex


On 08/14/2017 07:11 AM, Anders Widell wrote:
------------------------------------------------------------------------
NOTICE: This email was received from an EXTERNAL sender
------------------------------------------------------------------------

The idea is to still have an "initial discovery phase" where we
broadcasts messages at a high rate, and then to continue broadcasting at a lower rate after the initial discovery phase. If we have only one configuration option and enable this new continuous broadcast mode by setting that option to zero as you suggest, we would need to hard-code
the length of the initial phase (to five seconds?) as well as the
interval between broadcasts (30 seconds?) after the initial phase. Since we anyhow intend to implement a more sophisticated solution where only one of the nodes in the cluster continues to send broadcast messages, I
think it could be OK to hard-code. What do you think, Alex?

regards,

Anders Widell


On 08/14/2017 07:56 AM, A V Mahesh wrote:
> Hi Alex Jones,
>
> I don't think you required an additional configuration
> (CONT_BCAST_INT) for continuing dtm discovery,
> you can utilize the DTM_INI_DIS_TIMEOUT_SECS=5 (
> dtm_discovery_timeout_secs ), it self like :
> Just by adding additional comment saying :
>
> "# Comment the next line (DTM_INI_DIS_TIMEOUT_SECS) to enable
> continuing dtm discovery"
>
> ==========================================================================================
>
> # dtm_discovery_timeout_secs: This is the time dtmc should be allowed
> to wait
> # for a service script finish executing. If the service is not finished
> # within this time dtmc will send a timeout message to DTMSv
> # Mandatory
> DTM_INI_DIS_TIMEOUT_SECS=5
> ==========================================================================================
>
>
> And while reading configuration DTM_INI_DIS_TIMEOUT_SECS is ZERO,
> change the logic to continuing dtm discovery
>
> -AVM
>
> On 8/11/2017 8:51 PM, Alex Jones wrote:
>> If TCP closes a socket (due to TCP_USER_TIMEOUT or some other issue)
>> because
>> of a network problem, and DTM declares that the node has exited the
>> cluster,
>> the node never attempts to come back into the cluster if it does not
>> reboot.
>>
>> Initial discovery broadcasts stop after 5 seconds.
>>
>> Discovery broadcasts should continue after the initial discovery
>> phase at a
>> reduced rate.
>> ---
>> src/dtm/dtmnd/dtm_cb.h | 1 +
>> src/dtm/dtmnd/dtm_main.c | 56
>> +++++++++++++++++++++++++----------------
>> src/dtm/dtmnd/dtm_read_config.c | 17 +++++++++++++
>> src/dtm/dtmnd/dtmd.conf | 8 ++++--
>> 4 files changed, 58 insertions(+), 24 deletions(-)
>>
>> diff --git a/src/dtm/dtmnd/dtm_cb.h b/src/dtm/dtmnd/dtm_cb.h
>> index f7c48c4..c5b5890 100644
>> --- a/src/dtm/dtmnd/dtm_cb.h
>> +++ b/src/dtm/dtmnd/dtm_cb.h
>> @@ -80,6 +80,7 @@ typedef struct dtm_internode_cb {
>> DTM_IP_ADDR_TYPE i_addr_family; /* Indicates V4 or V6 */
>> uint8_t mcast_flag; /* Indicates mcast */
>> int32_t initial_dis_timeout;
>> + int32_t cont_bcast_int;
>> int64_t bcast_msg_freq;
>> NCS_PATRICIA_TREE nodeid_tree; /* NODE_DB information of Nodes */ >> NCS_PATRICIA_TREE comm_sock_tree; /* NODE_DB information of Nodes */ >> diff --git a/src/dtm/dtmnd/dtm_main.c b/src/dtm/dtmnd/dtm_main.c
>> index 724ce93..9f6d7cc 100644
>> --- a/src/dtm/dtmnd/dtm_main.c
>> +++ b/src/dtm/dtmnd/dtm_main.c
>> @@ -233,6 +233,30 @@ err:
>> return rc;
>> }
>> +static uint32_t dtm_send_bcast_mcast(DTM_INTERNODE_CB *dtms_cb,
>> + void *send_bcast_buffer,
>> + size_t bcast_buf_len)
>> +{
>> + uint32_t rc;
>> +
>> + TRACE_ENTER();
>> +
>> + if (dtms_cb->mcast_flag == true) {
>> + rc = dtm_dgram_sendto_mcast(dtms_cb, send_bcast_buffer,
>> bcast_buf_len);
>> + if (NCSCC_RC_SUCCESS != rc) {
>> + LOG_ER("DTM: dtm_dgram_sendto_mcast Failed rc : %d \n",
>> rc);
>> + }
>> + } else {
>> + rc = dtm_dgram_sendto_bcast(dtms_cb, send_bcast_buffer,
>> bcast_buf_len);
>> + if (NCSCC_RC_SUCCESS != rc) {
>> + LOG_ER("DTM: dtm_dgram_sendto_bcast Failed rc : %d \n",
>> rc);
>> + }
>> + }
>> +
>> + TRACE_LEAVE();
>> + return rc;
>> +}
>> +
>> /**
>> * DTM process main function
>> *
>> @@ -357,26 +381,7 @@ int main(int argc, char *argv[])
>> /* Broadcast msg string in datagram to clients every 250 m
>> * seconds */
>> - if (dtms_cb->mcast_flag == true) {
>> -
>> - rc = dtm_dgram_sendto_mcast(dtms_cb, send_bcast_buffer,
>> - bcast_buf_len);
>> - if (NCSCC_RC_SUCCESS != rc) {
>> - LOG_ER(
>> - "DTM: dtm_dgram_sendto_mcast Failed rc : %d \n",
>> - rc);
>> - }
>> -
>> - } else {
>> -
>> - rc = dtm_dgram_sendto_bcast(dtms_cb, send_bcast_buffer,
>> - bcast_buf_len);
>> - if (NCSCC_RC_SUCCESS != rc) {
>> - LOG_ER(
>> - "DTM: dtm_dgram_sendto_bcast Failed rc : %d \n",
>> - rc);
>> - }
>> - }
>> + dtm_send_bcast_mcast(dtms_cb, send_bcast_buffer,
>> bcast_buf_len);
>> dis_elapsed_time_usec =
>> dis_elapsed_time_usec + (dtms_cb->bcast_msg_freq * 1000);
>> @@ -387,8 +392,15 @@ int main(int argc, char *argv[])
>> /*************************************************************/
>> initial_discovery_phase = false;
>> while (1) {
>> - m_NCS_TASK_SLEEP(0xfffffff0);
>> - /* m_NCS_TASK_SLEEP(30000); */
>> + if (dtms_cb->cont_bcast_int) {
>> + m_NCS_TASK_SLEEP(dtms_cb->cont_bcast_int * 1000);
>> + /* periodically send a broadcast */
>> + dtm_send_bcast_mcast(dtms_cb,
>> + send_bcast_buffer,
>> + bcast_buf_len);
>> + } else {
>> + m_NCS_TASK_SLEEP(0xfffffff0);
>> + }
>> }
>> done1:
>> LOG_ER("DTM : dtm_destroy_service_discovery_task exiting...");
>> diff --git a/src/dtm/dtmnd/dtm_read_config.c
>> b/src/dtm/dtmnd/dtm_read_config.c
>> index 85f34f4..0db21e8 100644
>> --- a/src/dtm/dtmnd/dtm_read_config.c
>> +++ b/src/dtm/dtmnd/dtm_read_config.c
>> @@ -39,6 +39,7 @@ extern uint32_t intranode_max_processes;
>> #define KEEPALIVE_PROBES 9
>> #define DIS_TIME_OUT 5
>> #define BCAST_FRE 250
>> +#define CONT_BCAST_INT 30
>> #define USER_TIMEOUT 1500 // 1.5 sec to match other transport
>> const char *IN6ADDR_LINK_LOCAL =
>> @@ -108,6 +109,8 @@ void dtm_print_config(DTM_INTERNODE_CB *config)
>> TRACE(" %d", config->initial_dis_timeout);
>> TRACE(" DTM_BCAST_FRE_MSECS: ");
>> TRACE(" %" PRId64 "", config->bcast_msg_freq);
>> + TRACE(" DTM_CONT_BCAST_INT: ");
>> + TRACE(" %d", config->cont_bcast_int);
>> TRACE(" DTM_SOCK_SND_BUF_SIZE: ");
>> TRACE(" %d", config->sock_sndbuf_size);
>> TRACE(" DTM_SOCK_RCV_BUF_SIZE: ");
>> @@ -277,6 +280,7 @@ int dtm_read_config(DTM_INTERNODE_CB *config,
>> char *dtm_config_file)
>> config->comm_keepidle_time = USER_TIMEOUT;
>> config->i_addr_family = DTM_IP_ADDR_TYPE_IPV4;
>> config->bcast_msg_freq = BCAST_FRE;
>> + config->cont_bcast_int = CONT_BCAST_INT;
>> config->initial_dis_timeout = DIS_TIME_OUT;
>> config->sock_sndbuf_size = 0;
>> config->sock_rcvbuf_size = 0;
>> @@ -431,6 +435,19 @@ int dtm_read_config(DTM_INTERNODE_CB *config,
>> char *dtm_config_file)
>> tag = 0;
>> tag_len = 0;
>> +
>> + }
>> + if (strncmp(line, "DTM_CONTINUOUS_BCAST_INT=",
>> strlen("DTM_CONTINUOUS_BCAST_INT=")) == 0) {
>> + tag_len = strlen("DTM_CONTINUOUS_BCAST_INT=");
>> + config->cont_bcast_int = atoi(&line[tag_len]);
>> + if (config->cont_bcast_int < 0) {
>> + LOG_ER("DTM:cont_bcast_int must be 0 or greater");
>> + fclose(dtm_conf_file);
>> + return -1;
>> + }
>> +
>> + tag = 0;
>> + tag_len = 0;
>> }
>> if (strncmp(line, "DTM_BCAST_FRE_MSECS=",
>> strlen("DTM_BCAST_FRE_MSECS=")) == 0) {
>> diff --git a/src/dtm/dtmnd/dtmd.conf b/src/dtm/dtmnd/dtmd.conf
>> index c93b340..51bebb2 100644
>> --- a/src/dtm/dtmnd/dtmd.conf
>> +++ b/src/dtm/dtmnd/dtmd.conf
>> @@ -38,10 +38,14 @@ DTM_UDP_BCAST_SND_PORT=6800
>> DTM_UDP_BCAST_REV_PORT=6900
>> #
>> -# udp_broadcast_port: The UDP port that the DTMSv listens on
>> -# Mandatory
>> +# bcast_freq: frequencey (in milliseconds) between broadcasts during
>> initial
>> +# discovery
>> DTM_BCAST_FRE_MSECS=250
>> +# cont_bcast_int: interval (in seconds) between broadcasts after
>> initial
>> +# discovery has completed
>> +DTM_CONTINUOUS_BCAST_INT=30
>> +
>> # dtm_discovery_timeout_secs: This is the time dtmc should be
>> allowed to wait
>> # for a service script finish executing. If the service is not
>> finished
>> # within this time dtmc will send a timeout message to DTMSv
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=ZCqQT5AxNcsZPEiigY-zLMKSGHl4Jq5NCGA5stpZzA0&s=I1-eCPraHZz8gq4zxJksmaF4atfUxhUyRPy0lpU7sJI&e= _______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=ZCqQT5AxNcsZPEiigY-zLMKSGHl4Jq5NCGA5stpZzA0&s=8tq9JPVRftg9M1HD9iykQrolo2m1QpDh1nMHKhEiqv0&e=






------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=n5aVHtvfxtC6zsEKpgkA-hCtkAE2stJZszSe5WSzAM4&s=eqDdjKotS1P_UhOd6zkhS3mURCrLAlrJLG2_ppBQpfw&e= _______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=n5aVHtvfxtC6zsEKpgkA-hCtkAE2stJZszSe5WSzAM4&s=6h_tfc05V_JphyVVh4S2gkPvsIjqZPXLwz9nkmQI-G4&e=



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to