Re: [devel] [PATCH 1/1] dtmd: send broadcast messages after initial discovery [#2522]

A V Mahesh Wed, 16 Aug 2017 01:59:15 -0700

Hi Anders,

I did try to reproduce  quickly , not successful in reproducing ,

It looks to me some network interfaces boot order and openSAF startrelated issue,

I will try to reproduce the problem and update you.


================================================================================
On 8/16/2017 2:02 PM, Anders Widell wrote:

> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: dtm_node_addfailed .node_ip: 10.0.3.104, node_id: 0

> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: accept() failed
> 2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: NO Lost contact with 'PL-4'

> 2017-08-15 10:41:21.319 SC-1 kernel:[6471411.217909] IPv6:ADDRCONF(NETDEV_UP): veth1BBPKW: link is not ready

================================================================================

-AVM


On 8/16/2017 2:02 PM, Anders Widell wrote:

Here are the logs from SC-1. It seems to be starting at the same timeas PL-4:
2017-08-15 10:41:20.262 SC-1 opensafd: Starting OpenSAFServices(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) (Using TCP)2017-08-15 10:41:20.263 SC-1 opensafd[113]: /etc/init.d/opensafd: 113:/etc/init.d/opensafd: cannot create /proc/sys/kernel/core_pattern:Read-only file system2017-08-15 10:41:20.289 SC-1 osafdtmd[156]: mkfifo already exists:/var/lib/opensaf/osafdtmd.fifo File exists
2017-08-15 10:41:20.291 SC-1 osafdtmd[156]: Started
2017-08-15 10:41:20.296 SC-1 osaftransportd[166]: Started
2017-08-15 10:41:20.307 SC-1 osafclmna[170]: Started
2017-08-15 10:41:20.320 SC-1 osafrded[179]: Started
2017-08-15 10:41:20.337 SC-1 osaffmd[188]: Started
2017-08-15 10:41:20.338 SC-1 osaffmd[188]: NO Remote fencing is disabled
2017-08-15 10:41:20.354 SC-1 osafimmd[198]: Started
2017-08-15 10:41:20.370 SC-1 osafimmnd[209]: Started
2017-08-15 10:41:20.371 SC-1 osafimmnd[209]: NO Persistent Back-Endcapability configured, Pbe file:imm.db (suffix may get added)2017-08-15 10:41:20.396 SC-1 opensafd[145]: NO Monitoring of TRANSPORTstarted2017-08-15 10:41:20.507 SC-1 osafclmna[170]: NO Starting to promotethis node to a system controller
2017-08-15 10:41:20.507 SC-1 osafrded[179]: NO Requesting ACTIVE role
2017-08-15 10:41:20.507 SC-1 osafrded[179]: NO RDE role set to Undefined
2017-08-15 10:41:20.542 SC-1 osafdtmd[156]: NO Established contactwith 'PL-4'2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: dtm_node_addfailed .node_ip: 10.0.3.104, node_id: 0
2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: ER DTM: accept() failed
2017-08-15 10:41:20.543 SC-1 osafdtmd[156]: NO Lost contact with 'PL-4'
2017-08-15 10:41:21.319 SC-1 kernel:[6471411.217909] IPv6:ADDRCONF(NETDEV_UP): veth1BBPKW: link is not ready2017-08-15 10:41:22.509 SC-1 osafrded[179]: NO Running'/usr/local/lib/opensaf/opensaf_sc_active' with 0 argument(s)2017-08-15 10:41:22.534 SC-1 osafrded[179]: NO Switched to ACTIVE fromUndefined2017-08-15 10:41:22.534 SC-1 osaffmd[188]: NO Starting activationsupervision: 300000ms2017-08-15 10:41:22.535 SC-1 osafimmnd[209]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:22.535 SC-1 osafrded[179]: MDTM:SOCKET recd_bytes :0,conn lost with dh server2017-08-15 10:41:22.535 SC-1 osafclmna[170]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:22.536 SC-1 osaftransportd[166]: Rebooting OpenSAFNodeId = 0 EE Name = No EE Mapped, Reason: osafdtmd Process down,Rebooting the node, OwnNodeId = 0, SupervisionTime = 602017-08-15 10:41:22.546 SC-1 opensaf_reboot: Rebooting local node;timeout=60
On 08/16/2017 10:22 AM, Anders Widell wrote:
It is quite easy to reproduce the "dtm_node_add failed" log message,which leads to a reboot. I have only seen the core dump once, though.
Here is the log from the time when I got the core dump:

2017-08-15 10:41:15.218 PL-4 systemd[1]: Started LSB: ping controllers.
2017-08-15 10:41:15.219 PL-4 systemd[1]: Failed to reset devices.liston /system.slice/opensafd.service: Operation not permitted
2017-08-15 10:41:15.219 PL-4 systemd[1]: Starting OpenSAF daemon...
2017-08-15 10:41:15.219 PL-4 systemd[1]: Failed to reset devices.liston /system.slice/plsync.service: Operation not permitted2017-08-15 10:41:15.253 PL-4 opensafd: Starting OpenSAFServices(5.17.10 - 5302e11d528a33cce150a1583b829452b96b440d) (Using TCP)2017-08-15 10:41:15.254 PL-4 opensafd[134]: /etc/init.d/opensafd:113: /etc/init.d/opensafd: cannot create/proc/sys/kernel/core_pattern: Read-only file system2017-08-15 10:41:15.279 PL-4 osafdtmd[169]: mkfifo already exists:/var/lib/opensaf/osafdtmd.fifo File exists
2017-08-15 10:41:15.280 PL-4 osafdtmd[169]: Started
2017-08-15 10:41:15.284 PL-4 osaftransportd[177]: Started
2017-08-15 10:41:15.297 PL-4 osafclmna[183]: Started
2017-08-15 10:41:15.312 PL-4 osafimmnd[193]: Started
2017-08-15 10:41:15.314 PL-4 osafimmnd[193]: NO Persistent Back-Endcapability configured, Pbe file:imm.db (suffix may get added)2017-08-15 10:41:15.383 PL-4 opensafd[158]: NO Monitoring ofTRANSPORT started2017-08-15 10:41:15.384 PL-4 opensafd[158]: NO Monitoring of CLMNAstarted2017-08-15 10:41:15.498 PL-4 osafclmna[183]: NO Starting to promotethis node to a system controller2017-08-15 10:41:15.532 PL-4 osafdtmd[169]: NO Established contactwith 'PL-3'2017-08-15 10:41:15.533 PL-4 osafdtmd[169]: NO Established contactwith 'SC-2'2017-08-15 10:41:15.534 PL-4 osafdtmd[169]: NO Established contactwith 'SC-1'2017-08-15 10:41:15.535 PL-4 osafimmnd[193]: NO IMMD service is UP... ScAbsenseAllowed?:0 introduced?:02017-08-15 10:41:15.535 PL-4 osafclmna[183]: NOsafNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f2017-08-15 10:41:15.537 PL-4 osafimmnd[193]: NO Fevs count adjustedto 2356 preLoadPid: 02017-08-15 10:41:15.616 PL-4 osafimmnd[193]: NO SERVER STATE:IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING2017-08-15 10:41:15.718 PL-4 osafimmnd[193]: NO SERVER STATE:IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING2017-08-15 10:41:15.819 PL-4 osafimmnd[193]: NO SERVER STATE:IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING2017-08-15 10:41:15.819 PL-4 osafimmnd[193]: NO NODE STATE->IMM_NODE_ISOLATED2017-08-15 10:41:16.252 PL-4 osafimmnd[193]: NO NODE STATE->IMM_NODE_W_AVAILABLE2017-08-15 10:41:16.323 PL-4 osafimmnd[193]: NO SERVER STATE:IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT2017-08-15 10:41:16.593 PL-4 osafimmnd[193]: NO NODE STATE->IMM_NODE_FULLY_AVAILABLE 27142017-08-15 10:41:16.593 PL-4 osafimmnd[193]: NO RepositoryInitModeTis SA_IMM_INIT_FROM_FILE2017-08-15 10:41:16.594 PL-4 osafimmnd[193]: WA IMM Access Controlmode is DISABLED!2017-08-15 10:41:16.594 PL-4 osafimmnd[193]: NO Epoch set to 6 inImmModel2017-08-15 10:41:16.656 PL-4 osafimmnd[193]: NO SERVER STATE:IMM_SERVER_SYNC_CLIENT --> IMM_SERVER_READY2017-08-15 10:41:16.656 PL-4 opensafd[158]: NO Monitoring of IMMNDstarted2017-08-15 10:41:16.656 PL-4 osafimmnd[193]: NO ImmModel receivedscAbsenceAllowed 0
2017-08-15 10:41:16.672 PL-4 osafamfnd[204]: Started
2017-08-15 10:41:16.677 PL-4 osafamfnd[204]: NO Sending node up dueto NCSMDS_UP2017-08-15 10:41:16.682 PL-4 osafamfnd[204]: NOsaAmfCtDefQuiescingCompleteTimeout for'safVersion=5.5.5,safCompType=AmfDemo2' initialized withsaAmfCtDefCallbackTimeout2017-08-15 10:41:16.698 PL-4 osafamfnd[204]: NO'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State UNINSTANTIATED=> INSTANTIATING
2017-08-15 10:41:16.734 PL-4 osafamfwd[215]: Started
2017-08-15 10:41:16.776 PL-4 osafckptnd[225]: Started
2017-08-15 10:41:16.847 PL-4 osafsmfnd[249]: Started
2017-08-15 10:41:16.885 PL-4 osafamfnd[204]: NO'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State INSTANTIATING=> INSTANTIATED2017-08-15 10:41:16.894 PL-4 osafamfnd[204]: NO Assigning'safSi=NoRed4,safApp=OpenSAF' ACTIVE to'safSu=PL-4,safSg=NoRed,safApp=OpenSAF'2017-08-15 10:41:16.894 PL-4 osafamfnd[204]: NO Assigned'safSi=NoRed4,safApp=OpenSAF' ACTIVE to'safSu=PL-4,safSg=NoRed,safApp=OpenSAF'2017-08-15 10:41:16.908 PL-4 opensafd: OpenSAF(5.17.10 -5302e11d528a33cce150a1583b829452b96b440d) services successfully started2017-08-15 10:41:16.912 PL-4 opensafd[134]: Starting OpenSAF Services(Using TCP): *
2017-08-15 10:41:16.914 PL-4 systemd[1]: Started OpenSAF daemon.
2017-08-15 10:41:16.996 PL-4 systemd[1]: Reached target Multi-UserSystem.2017-08-15 10:41:16.996 PL-4 systemd[1]: Reached target GraphicalInterface.2017-08-15 10:41:16.996 PL-4 systemd[1]: Failed to reset devices.liston /system.slice/systemd-update-utmp-runlevel.service: Operation notpermitted2017-08-15 10:41:17.011 PL-4 systemd[1]: Starting Update UTMP aboutSystem Runlevel Changes...2017-08-15 10:41:17.017 PL-4 systemd[1]: Started Update UTMP aboutSystem Runlevel Changes.
2017-08-15 10:41:17.017 PL-4 systemd[1]: Startup finished in 6.981s.
2017-08-15 10:41:17.959 PL-4 osafimmnd[193]: NO Implementerdisconnected 23 <0, 2010f> (@OpenSafImmReplicatorB)2017-08-15 10:41:17.964 PL-4 osafimmnd[193]: NO Global discard nodereceived for nodeId:2010f pid:2092017-08-15 10:41:17.964 PL-4 osafimmnd[193]: NO Implementerdisconnected 24 <0, 2010f(down)> (@safAmfService2010f)
2017-08-15 10:41:19.208 PL-4 osafdtmd[169]: NO Lost contact with 'SC-1'
2017-08-15 10:41:20.542 PL-4 osafdtmd[169]: ER DTM: dtm_node_addfailed .node_ip: 10.0.3.101, node_id: 1313432017-08-15 10:41:20.548 PL-4 osafckptnd[225]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:20.549 PL-4 osafsmfnd[249]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:20.550 PL-4 osafamfnd[204]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:20.551 PL-4 osafamfwd[215]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:20.552 PL-4 osafimmnd[193]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:20.552 PL-4 osafclmna[183]: MDTM:SOCKET recd_bytes:0, conn lost with dh server2017-08-15 10:41:20.553 PL-4 osaftransportd[177]: Rebooting OpenSAFNodeId = 0 EE Name = No EE Mapped, Reason: osafdtmd Process down,Rebooting the node, OwnNodeId = 0, SupervisionTime = 602017-08-15 10:41:20.570 PL-4 opensaf_reboot: Rebooting local node;timeout=60
On 08/16/2017 10:11 AM, A V Mahesh wrote:
Hi Anders,

Can you share the logs .

-AVM


On 8/16/2017 1:33 PM, Anders Widell wrote:
The problem seems to happen with a low probability when I reboot anode. Presumably, it can happen when two nodes try to establish TCPconnections towards each other at approximately the same time.
regards,

Anders Widell


On 08/16/2017 06:21 AM, A V Mahesh wrote:
Hi Anders ,

On 8/15/2017 6:13 PM, Anders Widell wrote:
Is it a pure performance optimization, or is there some potentialproblem that can happen if two nodes discover each other at thesame time and estabilsh TCP connections in both directions? I amseeing error log messages that dtm_node_add() has failed..
It is both, in case all node joining cluster simultaneously, itis designated to establish TCP connections by higher node ID ( weare using single connection for both tx & rx) , other wise it willbe a race.
In witch case you are getting error ?

-AVM

On 8/15/2017 6:13 PM, Anders Widell wrote:
A question to Mahesh: what is the reason behind the following fewlines of code:
    if (initial_discovery_phase == true) {
        if (node.node_id < dtms_cb->node_id) {
            TRACE(
"DTM: received node_id is less than local node_iddropping message cluster_id: %d node_id: %u",
                node.cluster_id, node.node_id);
            return sock_desc;
        }
    }
Is it a pure performance optimization, or is there some potentialproblem that can happen if two nodes discover each other at thesame time and estabilsh TCP connections in both directions? I amseeing error log messages that dtm_node_add() has failed..
regards,
Anders Widell

On 08/15/2017 11:09 AM, Anders Widell wrote:
Yes I agree, I think it is more flexible to have a newconfiguration option for this.
Some comments on the patch:
1) The new configuration option should specify broadcastinterval time in the unit of milliseconds rather than seconds,to make it possible to configure sub-second intervals.
2) I prefer to comment out the new option from the config file.A commented out option means that the user wishes to use thedefault value from the C code - which can potentially changeafter upgrading to a new version of OpenSAF. If the user wishesto explicitly set the configuration option then he can uncommentthe line.
3) When testing this patch I got this rather unsettling coredump, caused by "corrupted double-linked list" in malloc(). Iwill take a closer look at this to see if I can understand whatis happening. Could you also take a look? Maybe run osafdtmdwith valgrind:
Stack trace of core.1502786480.osafdtmd.169.PL-4
[New LWP 173]
[New LWP 171]
[New LWP 169]
[New LWP 174]
[Thread debugging using libthread_db enabled]
Using host libthread_db library"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/lib/opensaf/osafdtmd'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fe9ad080428 in __GI_raise (sig=sig@entry=6) at../sysdeps/unix/sysv/linux/raise.c:54
[Current thread is 1 (Thread 0x7fe9adaacb00 (LWP 173))]

Thread 4 (Thread 0x7fe9ada8cb00 (LWP 174)):
#0 0x00007fe9ad145b5d in poll () at../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x000055686a3b53ba in poll (__timeout=20000,__nfds=<optimized out>, __fds=<optimized out>) at/usr/include/x86_64-linux-gnu/bits/poll2.h:46
No locals.
#2  dtm_intranode_processing () at src/dtm/dtmnd/dtm_intra.c:713
        poll_ret_val = 0
        j = <optimized out>
        __FUNCTION__ = "dtm_intranode_processing"
#3 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9ada8cb00) atpthread_create.c:333
        __res = <optimized out>
        pd = 0x7fe9ada8cb00
        now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf ={140641617627904, -4592431484730074264, 1, 140721305493231,140641617628608, 0, 4582197893944318824, 4582198654531672936},mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data ={prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#4 0x00007fe9ad15182d in clone () at../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 3 (Thread 0x7fe9adacf740 (LWP 169)):
#0 __clock_nanosleep (clock_id=clock_id@entry=1,flags=flags@entry=1, req=req@entry=0x7ffc3b6d2130,rem=rem@entry=0x0) at../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
        oldstate = 0
        r = <optimized out>
        rem = 0x0
        req = 0x7ffc3b6d2130
        flags = 1
        clock_id = <optimized out>
#1 0x00007fe9ad65e034 in osaf_nanosleep(sleep_duration=sleep_duration@entry=0x7ffc3b6d2170) atsrc/base/osaf_time.c:44
        wakeup_time = {tv_sec = 6471440, tv_nsec = 210765046}
        retval = <optimized out>
#2 0x00007fe9ad65af67 in ncs_os_task(task=task@entry=0x7ffc3b6d2400,request=request@entry=NCS_OS_TASK_SLEEP) at src/base/os_defs.c:333
        ts = {tv_sec = 30, tv_nsec = 0}
        __PRETTY_FUNCTION__ = "ncs_os_task"
#3 0x00007fe9ad665143 in ncs_task_sleep (delay_in_ms=<optimizedout>) at src/base/sysf_tsk.c:141task = {info = {create = {i_entry_point = 0x7530, i_name= 0x0, i_priority = 0, i_stack_nbytes = 0, i_policy = 0,i_ep_arg = 0x0, o_handle = 0x0}, release = {i_handle = 0x7530},detach = {i_handle = 0x7530}, start = {i_handle = 0x7530}, stop= {i_handle = 0x7530}, sleep = {i_delay_in_ms = 30000},current_handle = {o_handle = 0x7530}}}#4 0x000055686a3b3c65 in main (argc=<optimized out>,argv=0x7ffc3b6d2668) at src/dtm/dtmnd/dtm_main.c:396
        rc = <optimized out>
send_bcast_buffer ="\000:\000\001\000\002\004\017\000\032,\002\061\060.0.3.104",'\000' <repeats 232 times>
        bcast_buf_len = 58
        dis_time_out_usec = 5000000
        dis_elapsed_time_usec = 5250000
        dtms_cb = 0x55686a5cb0a0 <_dtms_cb>
        __FUNCTION__ = "main"

Thread 2 (Thread 0x7fe9adaccb00 (LWP 171)):
#0 0x00007fe9ad145b5d in poll () at../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007fe9ad65cda0 in poll (__timeout=-1, __nfds=1,__fds=0x7fe9adacc260) at/usr/include/x86_64-linux-gnu/bits/poll2.h:46
No locals.
#2 osaf_poll_no_timeout (io_fds=0x7fe9adacc260, i_nfds=1) atsrc/base/osaf_poll.c:31
No locals.
#3 0x00007fe9ad65cfe5 in osaf_ppoll(io_fds=io_fds@entry=0x7fe9adacc260, i_nfds=i_nfds@entry=1,i_timeout_ts=0x0, i_sigmask=i_sigmask@entry=0x0) atsrc/base/osaf_poll.c:82
        start_time = {tv_sec = 0, tv_nsec = 1}
        time_left_ts = <optimized out>
        result = -1381186976
#4 0x00007fe9ad66416f in ncs_tmr_wait () atsrc/base/sysf_tmr.c:463
        rc = <optimized out>
        inds_rmvd = <optimized out>
        next_delay = <optimized out>
        tv = <optimized out>
        ts_current = {tv_sec = 6471404, tv_nsec = 949388789}
        ts = {tv_sec = 16777215, tv_nsec = 0}
        set = {fd = 6, events = 1, revents = 0}
#5 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9adaccb00) atpthread_create.c:333
        __res = <optimized out>
        pd = 0x7fe9adaccb00
        now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf ={140641617890048, -4592431484730074264, 1, 140721305492959,140641617890752, 0, 4582197928304057192, 4582198654531672936},mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data ={prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6 0x00007fe9ad15182d in clone () at../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 1 (Thread 0x7fe9adaacb00 (LWP 173)):
#0 0x00007fe9ad080428 in __GI_raise (sig=sig@entry=6) at../sysdeps/unix/sysv/linux/raise.c:54
        resultvar = 0
        pid = 169
        selftid = 173
#1  0x00007fe9ad08202a in __GI_abort () at abort.c:89
        save_stage = 2
act = {__sigaction_handler = {sa_handler =0x66372d3030306464, sa_sigaction = 0x66372d3030306464}, sa_mask= {__val = {3487303855927015781, 2337418197644357680,3472328296227680304, 3467824696768081952, 7076108730981878304,3975887029563253302, 3775202183040165478, 8083248238249914416,3472328296227680288, 2319406834570502192, 2314885530818453552,2314885530818453536, 2314885530818453536, 7161132921294692384,3702915907824803179, 3256155514953152354}}, sa_flags =1667655223, sa_restorer = 0x65}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007fe9ad0c27ea in __libc_message (do_abort=2,fmt=fmt@entry=0x7fe9ad1db2e0 "*** Error in `%s': %s: 0x%s***\n") at ../sysdeps/posix/libc_fatal.c:175ap = <error reading variable ap (Attempt to dereferencea generic pointer.)>
        fd = 2
        on_2 = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#3 0x00007fe9ad0cb539 in malloc_printerr(ar_ptr=0x7fe9a4000020, ptr=<optimized out>, str=0x7fe9ad1d8095"corrupted double-linked list", action=3) at malloc.c:5004
        buf = "00007fe9a40012c0"
        cp = <optimized out>
        ar_ptr = 0x7fe9a4000020
        ptr = <optimized out>
        str = 0x7fe9ad1d8095 "corrupted double-linked list"
        action = 3
#4 _int_free (av=0x7fe9a4000020, p=<optimized out>,have_lock=0) at malloc.c:4012
        size = 624
        fb = <optimized out>
        nextchunk = <optimized out>
        nextsize = 736
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        errstr = 0x0
        locked = <optimized out>
#5 0x00007fe9ad0ce98c in __GI___libc_free(mem=mem@entry=0x7fe9a40010c0) at malloc.c:2966
        ar_ptr = <optimized out>
        p = 0x7fe9a40010b0
        hook = <optimized out>
#6 0x000055686a3bcbf2 in dtm_process_connect(dtms_cb=dtms_cb@entry=0x55686a5cb0a0 <_dtms_cb>,node_ip=node_ip@entry=0x7fe9adaabd90 "10.0.3.101",data=data@entry=0x7fe9adaabed0 "", len=len@entry=56) atsrc/dtm/dtmnd/dtm_node_sockets.c:1436
        foreign_port = 6700
        sock_desc = -1
node = {cluster_id = 1, node_id = 131343, node_name ='\000' <repeats 255 times>, node_ip = "10.0.3.101", '\000'<repeats 35 times>, i_addr_family = DTM_IP_ADDR_TYPE_IPV4,comm_socket = 0, pat_nodeid = {bit = 0, left = 0x0, right = 0x0,key_info = 0x0}, pat_comm_socket = {bit = 0, left = 0x0, right =0x0, key_info = 0x0}, pat_ip_address = {bit = 0, left = 0x0,right = 0x0, key_info = 0x0}, comm_status = 0 '\000', node_lock= {lock = {__data = {__lock = 0, __count = 0, __owner = 0,__nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list ={__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39times>, __align = 0}, count = 0, tid = 0}, mbx = 0, mbx_fd = 0,msgs_hdr = 0x0, msgs_tail = 0x0, bytes_tb_read = 0,buff_total_len = 0, len_buff = "\000", num_by_read_for_len_buff= 0 '\000', buffer = 0x0}
        new_node = 0x7fe9a40010c0
        buffer = 0x7fe9adaabedc "10.0.3.101"
        mcast_flag = <optimized out>
        __FUNCTION__ = "dtm_process_connect"
#7 0x000055686a3b859a in node_discovery_process (arg=<optimizedout>) at src/dtm/dtmnd/dtm_node.c:648
        new_sd = -1
        fd_check = 1
        __FUNCTION__ = "node_discovery_process"
        poll_ret = <optimized out>
        end_server = 0
        compress_array = <optimized out>
        close_conn = 0
        dtms_cb = 0x55686a5cb0a0 <_dtms_cb>
        current_size = <optimized out>
        i = <optimized out>
        j = <optimized out>
inbuf ="\000:\000\001\000\002\001\017\000\032,\002\061\060.0.3.101",'\000' <repeats 1001 times>
        data1 = 0x7fe9adaabed2 ""
        recd_bytes = <optimized out>
        recd_buf_len = <optimized out>
        node_info_buffer_len = 20
node_info_hrd ="\000\022V\022\064V\001\001\000\002\004\017\000\000\000\004PL-4",'\000' <repeats 250 times>node_ip ="10.0.3.101\000\000\000\000\000\000\020e<jhU\000\000\001",'\000' <repeats 20 times>#8 0x00007fe9ad41b6ba in start_thread (arg=0x7fe9adaacb00) atpthread_create.c:333
        __res = <optimized out>
        pd = 0x7fe9adaacb00
        now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf ={140641617758976, -4592431484730074264, 1, 140721305493695,140641617759680, 0, 4582197876764449640, 4582198654531672936},mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data ={prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#9 0x00007fe9ad15182d in clone () at../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
54 ../sysdeps/unix/sysv/linux/raise.c: No such file ordirectory.'
regards,

Anders Widell


On 08/14/2017 03:21 PM, Alex Jones wrote:
My feeling is that we should keep the two phases separate. Ifour new, scalable way doesn't make it in for a little while, atleast we have this, and it is configurable.
Alex


On 08/14/2017 07:11 AM, Anders Widell wrote:
------------------------------------------------------------------------
NOTICE: This email was received from an EXTERNAL sender
------------------------------------------------------------------------
The idea is to still have an "initial discovery phase" where we
broadcasts messages at a high rate, and then to continuebroadcasting ata lower rate after the initial discovery phase. If we haveonly oneconfiguration option and enable this new continuous broadcastmode bysetting that option to zero as you suggest, we would need tohard-code
the length of the initial phase (to five seconds?) as well as the
interval between broadcasts (30 seconds?) after the initialphase. Sincewe anyhow intend to implement a more sophisticated solutionwhere onlyone of the nodes in the cluster continues to send broadcastmessages, I
think it could be OK to hard-code. What do you think, Alex?

regards,

Anders Widell


On 08/14/2017 07:56 AM, A V Mahesh wrote:
> Hi Alex Jones,
>
> I don't think you required an additional configuration
> (CONT_BCAST_INT) for continuing dtm discovery,
> you can utilize the DTM_INI_DIS_TIMEOUT_SECS=5 (
> dtm_discovery_timeout_secs ), it self like :
> Just by adding additional comment saying :
>
> "# Comment the next line (DTM_INI_DIS_TIMEOUT_SECS) to enable
> continuing dtm discovery"
>
>==========================================================================================
>
> # dtm_discovery_timeout_secs: This is the time dtmc shouldbe allowed
> to wait
> # for a service script finish executing. If the service isnot finished
> # within this time dtmc will send a timeout message to DTMSv
> # Mandatory
> DTM_INI_DIS_TIMEOUT_SECS=5
>==========================================================================================
>
>
> And while reading configuration DTM_INI_DIS_TIMEOUT_SECS isZERO,
> change the logic to continuing dtm discovery
>
> -AVM
>
> On 8/11/2017 8:51 PM, Alex Jones wrote:
>> If TCP closes a socket (due to TCP_USER_TIMEOUT or someother issue)
>> because
>> of a network problem, and DTM declares that the node hasexited the
>> cluster,
>> the node never attempts to come back into the cluster if itdoes not
>> reboot.
>>
>> Initial discovery broadcasts stop after 5 seconds.
>>
>> Discovery broadcasts should continue after the initialdiscovery
>> phase at a
>> reduced rate.
>> ---
>> src/dtm/dtmnd/dtm_cb.h | 1 +
>> src/dtm/dtmnd/dtm_main.c | 56
>> +++++++++++++++++++++++++----------------
>> src/dtm/dtmnd/dtm_read_config.c | 17 +++++++++++++
>> src/dtm/dtmnd/dtmd.conf | 8 ++++--
>> 4 files changed, 58 insertions(+), 24 deletions(-)
>>
>> diff --git a/src/dtm/dtmnd/dtm_cb.h b/src/dtm/dtmnd/dtm_cb.h
>> index f7c48c4..c5b5890 100644
>> --- a/src/dtm/dtmnd/dtm_cb.h
>> +++ b/src/dtm/dtmnd/dtm_cb.h
>> @@ -80,6 +80,7 @@ typedef struct dtm_internode_cb {
>> DTM_IP_ADDR_TYPE i_addr_family; /* Indicates V4 or V6 */
>> uint8_t mcast_flag; /* Indicates mcast */
>> int32_t initial_dis_timeout;
>> + int32_t cont_bcast_int;
>> int64_t bcast_msg_freq;
>> NCS_PATRICIA_TREE nodeid_tree; /* NODE_DB information ofNodes */>> NCS_PATRICIA_TREE comm_sock_tree; /* NODE_DB information ofNodes */>> diff --git a/src/dtm/dtmnd/dtm_main.cb/src/dtm/dtmnd/dtm_main.c
>> index 724ce93..9f6d7cc 100644
>> --- a/src/dtm/dtmnd/dtm_main.c
>> +++ b/src/dtm/dtmnd/dtm_main.c
>> @@ -233,6 +233,30 @@ err:
>> return rc;
>> }
>> +static uint32_t dtm_send_bcast_mcast(DTM_INTERNODE_CB*dtms_cb,
>> + void *send_bcast_buffer,
>> + size_t bcast_buf_len)
>> +{
>> + uint32_t rc;
>> +
>> + TRACE_ENTER();
>> +
>> + if (dtms_cb->mcast_flag == true) {
>> + rc = dtm_dgram_sendto_mcast(dtms_cb, send_bcast_buffer,
>> bcast_buf_len);
>> + if (NCSCC_RC_SUCCESS != rc) {
>> + LOG_ER("DTM: dtm_dgram_sendto_mcast Failed rc : %d \n",
>> rc);
>> + }
>> + } else {
>> + rc = dtm_dgram_sendto_bcast(dtms_cb, send_bcast_buffer,
>> bcast_buf_len);
>> + if (NCSCC_RC_SUCCESS != rc) {
>> + LOG_ER("DTM: dtm_dgram_sendto_bcast Failed rc : %d \n",
>> rc);
>> + }
>> + }
>> +
>> + TRACE_LEAVE();
>> + return rc;
>> +}
>> +
>> /**
>> * DTM process main function
>> *
>> @@ -357,26 +381,7 @@ int main(int argc, char *argv[])
>> /* Broadcast msg string in datagram to clients every 250 m
>> * seconds */
>> - if (dtms_cb->mcast_flag == true) {
>> -
>> - rc = dtm_dgram_sendto_mcast(dtms_cb, send_bcast_buffer,
>> - bcast_buf_len);
>> - if (NCSCC_RC_SUCCESS != rc) {
>> - LOG_ER(
>> - "DTM: dtm_dgram_sendto_mcast Failed rc : %d \n",
>> - rc);
>> - }
>> -
>> - } else {
>> -
>> - rc = dtm_dgram_sendto_bcast(dtms_cb, send_bcast_buffer,
>> - bcast_buf_len);
>> - if (NCSCC_RC_SUCCESS != rc) {
>> - LOG_ER(
>> - "DTM: dtm_dgram_sendto_bcast Failed rc : %d \n",
>> - rc);
>> - }
>> - }
>> + dtm_send_bcast_mcast(dtms_cb, send_bcast_buffer,
>> bcast_buf_len);
>> dis_elapsed_time_usec =
>> dis_elapsed_time_usec + (dtms_cb->bcast_msg_freq * 1000);
>> @@ -387,8 +392,15 @@ int main(int argc, char *argv[])
>>/*************************************************************/
>> initial_discovery_phase = false;
>> while (1) {
>> - m_NCS_TASK_SLEEP(0xfffffff0);
>> - /* m_NCS_TASK_SLEEP(30000); */
>> + if (dtms_cb->cont_bcast_int) {
>> + m_NCS_TASK_SLEEP(dtms_cb->cont_bcast_int * 1000);
>> + /* periodically send a broadcast */
>> + dtm_send_bcast_mcast(dtms_cb,
>> + send_bcast_buffer,
>> + bcast_buf_len);
>> + } else {
>> + m_NCS_TASK_SLEEP(0xfffffff0);
>> + }
>> }
>> done1:
>> LOG_ER("DTM : dtm_destroy_service_discovery_task exiting...");
>> diff --git a/src/dtm/dtmnd/dtm_read_config.c
>> b/src/dtm/dtmnd/dtm_read_config.c
>> index 85f34f4..0db21e8 100644
>> --- a/src/dtm/dtmnd/dtm_read_config.c
>> +++ b/src/dtm/dtmnd/dtm_read_config.c
>> @@ -39,6 +39,7 @@ extern uint32_t intranode_max_processes;
>> #define KEEPALIVE_PROBES 9
>> #define DIS_TIME_OUT 5
>> #define BCAST_FRE 250
>> +#define CONT_BCAST_INT 30
>> #define USER_TIMEOUT 1500 // 1.5 sec to match other transport
>> const char *IN6ADDR_LINK_LOCAL =
>> @@ -108,6 +109,8 @@ void dtm_print_config(DTM_INTERNODE_CB*config)
>> TRACE(" %d", config->initial_dis_timeout);
>> TRACE(" DTM_BCAST_FRE_MSECS: ");
>> TRACE(" %" PRId64 "", config->bcast_msg_freq);
>> + TRACE(" DTM_CONT_BCAST_INT: ");
>> + TRACE(" %d", config->cont_bcast_int);
>> TRACE(" DTM_SOCK_SND_BUF_SIZE: ");
>> TRACE(" %d", config->sock_sndbuf_size);
>> TRACE(" DTM_SOCK_RCV_BUF_SIZE: ");
>> @@ -277,6 +280,7 @@ int dtm_read_config(DTM_INTERNODE_CB*config,
>> char *dtm_config_file)
>> config->comm_keepidle_time = USER_TIMEOUT;
>> config->i_addr_family = DTM_IP_ADDR_TYPE_IPV4;
>> config->bcast_msg_freq = BCAST_FRE;
>> + config->cont_bcast_int = CONT_BCAST_INT;
>> config->initial_dis_timeout = DIS_TIME_OUT;
>> config->sock_sndbuf_size = 0;
>> config->sock_rcvbuf_size = 0;
>> @@ -431,6 +435,19 @@ int dtm_read_config(DTM_INTERNODE_CB*config,
>> char *dtm_config_file)
>> tag = 0;
>> tag_len = 0;
>> +
>> + }
>> + if (strncmp(line, "DTM_CONTINUOUS_BCAST_INT=",
>> strlen("DTM_CONTINUOUS_BCAST_INT=")) == 0) {
>> + tag_len = strlen("DTM_CONTINUOUS_BCAST_INT=");
>> + config->cont_bcast_int = atoi(&line[tag_len]);
>> + if (config->cont_bcast_int < 0) {
>> + LOG_ER("DTM:cont_bcast_int must be 0 or greater");
>> + fclose(dtm_conf_file);
>> + return -1;
>> + }
>> +
>> + tag = 0;
>> + tag_len = 0;
>> }
>> if (strncmp(line, "DTM_BCAST_FRE_MSECS=",
>> strlen("DTM_BCAST_FRE_MSECS=")) == 0) {
>> diff --git a/src/dtm/dtmnd/dtmd.conf b/src/dtm/dtmnd/dtmd.conf
>> index c93b340..51bebb2 100644
>> --- a/src/dtm/dtmnd/dtmd.conf
>> +++ b/src/dtm/dtmnd/dtmd.conf
>> @@ -38,10 +38,14 @@ DTM_UDP_BCAST_SND_PORT=6800
>> DTM_UDP_BCAST_REV_PORT=6900
>> #
>> -# udp_broadcast_port: The UDP port that the DTMSv listens on
>> -# Mandatory
>> +# bcast_freq: frequencey (in milliseconds) betweenbroadcasts during
>> initial
>> +# discovery
>> DTM_BCAST_FRE_MSECS=250
>> +# cont_bcast_int: interval (in seconds) between broadcastsafter
>> initial
>> +# discovery has completed
>> +DTM_CONTINUOUS_BCAST_INT=30
>> +
>> # dtm_discovery_timeout_secs: This is the time dtmc should be
>> allowed to wait
>> # for a service script finish executing. If the service is not
>> finished
>> # within this time dtmc will send a timeout message to DTMSv
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=ZCqQT5AxNcsZPEiigY-zLMKSGHl4Jq5NCGA5stpZzA0&s=I1-eCPraHZz8gq4zxJksmaF4atfUxhUyRPy0lpU7sJI&e=_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=ZCqQT5AxNcsZPEiigY-zLMKSGHl4Jq5NCGA5stpZzA0&s=8tq9JPVRftg9M1HD9iykQrolo2m1QpDh1nMHKhEiqv0&e=
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=n5aVHtvfxtC6zsEKpgkA-hCtkAE2stJZszSe5WSzAM4&s=eqDdjKotS1P_UhOd6zkhS3mURCrLAlrJLG2_ppBQpfw&e=_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Ddevel&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=Ua8D9LRbRt6_3fFyQghU61tppZpWmDKj9QGctDqRYQQ&m=n5aVHtvfxtC6zsEKpgkA-hCtkAE2stJZszSe5WSzAM4&s=6h_tfc05V_JphyVVh4S2gkPvsIjqZPXLwz9nkmQI-G4&e=



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1/1] dtmd: send broadcast messages after initial discovery [#2522]

Reply via email to