Hi Mahesh,
I tested with opensaf4.5 TIPC as MDS, this issue is not seen. Have raised a ticket “*#1285 MDS TCP: zero bytes recvd results in application exit*” Regards, Girish *From:* A V Mahesh [mailto:[email protected]] *Sent:* Monday, February 23, 2015 10:04 AM *To:* Girish Nagaraj; [email protected] *Subject:* Re: [users] Issues with CPSv Hi, To confirm/isolate the problems further , test your application with TIPC transport with ticket #1227 fix ( both 4.3 & 4.5) . and provide your observations. If issue is NOT reproducible with TIPC transport , as a workaround prevent sending ZERO size ( hack message ) in your ckpt application for TCP transport and raise a ticket with all details as Mathi explained. -AVM On 2/20/2015 3:33 PM, Girish Nagaraj wrote: Hi, Yes, similar issue in TCP also: exits with message: Feb 20 15:24:59 fedvm1 RIB[28549]: MDTM:socket_recv() = 0, conn lost with dh server, exiting library err :Success Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart probation timer started (timeout: 4000000000 ns) Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO Restarting a component of 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp restart count: 1) Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO 'safComp=ribd,safSu=SU1,safSg=zebos-simplex,safApp=zebos' faulted due to 'avaDown' : Recovery is 'componentRestart' I experimented with code changes: recd_bytes = recv(tcp_cb->DBSRsock, tcp_cb->len_buff, 2, MSG_NOSIGNAL); if (0 == recd_bytes) { syslog(LOG_ERR, "MDTM:socket_recv() = %d, conn lost with dh server, exiting library err 111:%d", recd_bytes, errno); close(tcp_cb->DBSRsock); exit(0); } else if (2 == recd_bytes) { uint16_t local_len_buf = 0; data = tcp_cb->len_buff; local_len_buf = ncs_decode_16bit(&data); /* MY CHANGE START */ *if (0 == local_len_buf)* * return;* /* MY CHANGE END */ tcp_cb->buff_total_len = local_len_buf; tcp_cb->num_by_read_for_len_buff = 2; if (NULL == (tcp_cb->buffer = calloc(1, (local_len_buf + 1)))) { /* Length + 2 is done to reuse the same buffer while sending to other nodes */ syslog(LOG_ERR, "Memory allocation failed in dtm_intranode_processing"); return; } recd_bytes = recv(tcp_cb->DBSRsock, tcp_cb->buffer, local_len_buf, 0); if (recd_bytes < 0) { return; } else if (0 == recd_bytes) { syslog(LOG_ERR, "MDTM:socket_recv() = %d, conn lost with dh server, exiting library err 222:%d len:%d", recd_bytes, errno, local_len_buf); close(tcp_cb->DBSRsock); exit(0); This caused many other issues, so I think just returning won’t work. Regards, Girish -----Original Message----- From: A V Mahesh [mailto:[email protected]] Sent: Friday, February 20, 2015 1:38 PM To: Girish Nagaraj; [email protected] Subject: Re: [users] Issues with CPSv Hi, On 2/20/2015 1:19 PM, Girish Nagaraj wrote: > Hi , > > I think this is not connection loss, we are passing 0 (len of bytes > to be > read) to recv() function. Which returns back 0 received bytes. You mean, you are seeing issue similar to `TIPC ticket #1227 mds/tipc : protect mds application form zero bytes hacking messages` for TCP as well ? -AVM > > local_len_buf = ncs_decode_16bit(&data); > > Is there mistake in decoding local_len_buf? > > Regards, > Girish > > -----Original Message----- > From: A V Mahesh [mailto:[email protected] <[email protected]> ] > Sent: Friday, February 20, 2015 11:03 AM > To: [email protected] > Subject: Re: [users] Issues with CPSv > > Hi, > > On 2/19/2015 3:42 PM, Girish Nagaraj wrote: >> local_len_buf turns out be 0, this causes recv() to return 0 and >> application exits. Is this programming bug?? > This is expected behavior , if any connection loss happens on TCP > socket will recives ZERO size bytes, this not related to CPSv. > > -AVM > > > On 2/19/2015 3:42 PM, Girish Nagaraj wrote: >> Hi, >> >> >> >> *Background*: >> >> Opensaf version: 4.5 >> >> Number of checkpoints used: 2 >> >> In our application we use CPSv to save application data and when >> application faults, it is restarted and it’s state is restored back >> by reading data from checkpoints >> >> Model: Simplex >> >> >> >> * Issue faced:* >> >> application sometimes crashes, stack trace as below: >> >> >> >> Program received signal SIGSEGV, Segmentation fault. >> >> search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 >> "H\356\367\b") at patricia.c:94 >> >> 94 patricia.c: No such file or directory. >> >> (gdb) bt >> >> #0 search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 >> "H\356\367\b") at patricia.c:94 >> >> #1 0xb76d0bef in ncs_patricia_tree_get (pTree=pTree@entry=0x8f733e4, >> pKey=pKey@entry=0xbfa0cdf8 "H\356\367\b") at patricia.c:434 >> >> #2 0xb7738493 in cpa_lcl_ckpt_node_get >> (lcl_ckpt_tree=lcl_ckpt_tree@entry=0x8f733e4, >> lc_hdl=lc_hdl@entry=0xbfa0cdf8, lc_node=lc_node@entry=0xbfa0ce10) >> >> at cpa_db.c:195 >> >> #3 0xb7734d76 in saCkptCheckpointWrite (checkpointHandle=150466120, >> ioVector=0x92c6d28, numberOfElements=1320, >> >> erroneousVectorIndex=erroneousVectorIndex@entry=0xbfa0d35c) at >> cpa_api.c:3134 >> >> >> >> (gdb) p pNode >> >> $2 = (NCS_PATRICIA_NODE *) 0x5e >> >> (gdb) p *pTree >> >> $4 = {root_node = {bit = -1, left = 0x8f7e9c0, right = 0x8f733e4, >> key_info = 0x8f734b8 ""}, params = {key_size = 8, info_size = 0, >> actual_key_size = 0, >> >> node_size = 0}, n_nodes = 3} >> >> >> >> sometimes application exits with below message: >> >> >> >> Feb 19 15:13:31 controller2 RIB[28395]: MDTM:socket_recv() = 0, conn >> lost with dh server, exiting library err:0 len:0 >> >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO >> 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart >> probation timer started (timeout: 4000000000 ns) >> >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO Restarting a >> component of 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp >> restart count: 1) >> >> >> >> >> >> Below is the modified code snippet from file >> osaf/libs/core/mds/mds_dt_trans.c >> >> >> >> } else if (2 == recd_bytes) { >> >> uint16_t local_len_buf = 0; >> >> >> >> data = tcp_cb->len_buff; >> >> local_len_buf = >> ncs_decode_16bit(&data); >> >> tcp_cb->buff_total_len = >> local_len_buf; >> >> tcp_cb->num_by_read_for_len_buff = >> 2; >> >> >> >> if (NULL == (tcp_cb->buffer = >> calloc(1, (local_len_buf + 1)))) { >> >> /* Length + 2 is done to >> reuse the same buffer >> >> while sending to other >> nodes */ >> >> syslog(LOG_ERR, "Memory >> allocation failed in dtm_intranode_processing"); >> >> return; >> >> } >> >> recd_bytes = recv(tcp_cb->DBSRsock, >> tcp_cb->buffer, local_len_buf, 0); >> >> if (recd_bytes < 0) { >> >> return; >> >> } else if (0 == recd_bytes) { >> >> syslog(LOG_ERR, >> "MDTM:socket_recv() = %d, conn lost with dh server, exiting library >> err:%d len:%d", recd_bytes, errno, local_len_buf); >> >> close(tcp_cb->DBSRsock); >> >> exit(0); *<<<<<<<EXITS >> HERE>>>>>>>>>>* >> >> } else if (local_len_buf > >> recd_bytes) { >> >> /* can happen only in two >> cases, system call interrupt or half data, */ >> >> TRACE("less data recd, recd >> bytes = %d, actual len = %d", recd_bytes, >> >> local_len_buf); >> >> tcp_cb->bytes_tb_read = >> tcp_cb->buff_total_len - recd_bytes; >> >> return; >> >> >> >> local_len_buf turns out be 0, this causes recv() to return 0 and >> application exits. Is this programming bug?? >> >> >> >> Could someone please help to resolve these issues. >> >> >> >> Regards, >> >> Girish >> > > ---------------------------------------------------------------------- > -------- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT > Server from Actuate! Instantly Supercharge Your Business Reports and > Dashboards with Interactivity, Sharing, Native Excel Exports, App > Integration & more Get technology previously reserved for > billion-dollar corporations, FREE > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg. > clktrk _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users > . -- . ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
