Girish, It appears that the crash is happening in the CPA (CheckpointAgent) library linked to your application. To enable the CPA traces, you could export CPA_TRACE_PATHNAME=/tmp/myapp_cpadebug.log before starting your application and share the traces.
Cheers, Mathi. ----- [email protected] wrote: > Hi Mathi/Mahesh, > > First of all thanks for helping me in resolving this issue. > > Do you require CPA(application) or traces of CPA? If it is traces, > please > let me know how to get it. > > Regards, > Girish > > -----Original Message----- > From: Mathivanan Naickan Palanivelu [mailto:[email protected]] > Sent: Friday, February 20, 2015 3:55 PM > To: [email protected] > Cc: [email protected]; [email protected] > Subject: Re: [users] Issues with CPSv > > Hi, > > Please raise a ticket for this crash and share the traces of CPND and > CPA(your application). > Also, you should specify a testcase or try to explain what the > application > is doing and at what point the crash is occuring? > > > Thanks, > Mathi. > > ----- [email protected] wrote: > > > Hi, > > > > > > > > I don’t get this issue with opensaf version 4.3, but I get > segfault: > > > > > > > > application sometimes crashes, stack trace as below: > > > > > > > > Program received signal SIGSEGV, Segmentation fault. > > > > search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 > > "H\356\367\b") at patricia.c:94 > > > > 94 patricia.c: No such file or directory. > > > > (gdb) bt > > > > #0 search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 > > "H\356\367\b") at patricia.c:94 > > > > #1 0xb76d0bef in ncs_patricia_tree_get > (pTree=pTree@entry=0x8f733e4, > > pKey=pKey@entry=0xbfa0cdf8 "H\356\367\b") at patricia.c:434 > > > > #2 0xb7738493 in cpa_lcl_ckpt_node_get > > (lcl_ckpt_tree=lcl_ckpt_tree@entry=0x8f733e4, > > lc_hdl=lc_hdl@entry=0xbfa0cdf8, lc_node=lc_node@entry=0xbfa0ce10) > > > > at cpa_db.c:195 > > > > #3 0xb7734d76 in saCkptCheckpointWrite > (checkpointHandle=150466120, > > ioVector=0x92c6d28, numberOfElements=1320, > > > > erroneousVectorIndex=erroneousVectorIndex@entry=0xbfa0d35c) at > > cpa_api.c:3134 > > > > > > > > (gdb) p pNode > > > > $2 = (NCS_PATRICIA_NODE *) 0x5e > > > > (gdb) p *pTree > > > > $4 = {root_node = {bit = -1, left = 0x8f7e9c0, right = 0x8f733e4, > > key_info = 0x8f734b8 ""}, params = {key_size = 8, info_size = 0, > > actual_key_size = 0, > > > > node_size = 0}, n_nodes = 3} > > > > > > > > > > > > Regards, > > > > Girish > > > > > > > > *From:* Girish Nagaraj [mailto:[email protected]] > > *Sent:* Friday, February 20, 2015 3:34 PM > > *To:* 'A V Mahesh'; '[email protected]' > > *Subject:* RE: [users] Issues with CPSv > > > > > > > > Hi, > > > > > > > > Yes, similar issue in TCP also: exits with message: > > > > > > > > Feb 20 15:24:59 fedvm1 RIB[28549]: MDTM:socket_recv() = 0, conn > lost > > with dh server, exiting library err :Success > > > > Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO > > 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart > > probation timer started (timeout: 4000000000 ns) > > > > Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO Restarting a component > of > > 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp restart count: > 1) > > > > Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO > > 'safComp=ribd,safSu=SU1,safSg=zebos-simplex,safApp=zebos' faulted > due > > to 'avaDown' : Recovery is 'componentRestart' > > > > > > > > I experimented with code changes: > > > > > > > > recd_bytes = recv(tcp_cb->DBSRsock, tcp_cb->len_buff, 2, > > MSG_NOSIGNAL); > > > > if (0 == recd_bytes) { > > > > syslog(LOG_ERR, "MDTM:socket_recv() > = > > %d, conn lost with dh server, exiting library err 111:%d", > recd_bytes, > > errno); > > > > close(tcp_cb->DBSRsock); > > > > exit(0); > > > > } else if (2 == recd_bytes) { > > > > uint16_t local_len_buf = 0; > > > > > > > > data = tcp_cb->len_buff; > > > > local_len_buf = > > ncs_decode_16bit(&data); > > > > > > > > /* MY CHANGE START */ > > > > *if (0 == local_len_buf)* > > > > * return;* > > > > /* MY CHANGE END */ > > > > > > > > tcp_cb->buff_total_len = > > local_len_buf; > > > > tcp_cb->num_by_read_for_len_buff = > 2; > > > > > > > > if (NULL == (tcp_cb->buffer = > > calloc(1, (local_len_buf + 1)))) { > > > > /* Length + 2 is done to > reuse > > the same buffer > > > > while sending to other > > nodes */ > > > > syslog(LOG_ERR, "Memory > > allocation failed in dtm_intranode_processing"); > > > > return; > > > > } > > > > recd_bytes = recv(tcp_cb->DBSRsock, > > tcp_cb->buffer, local_len_buf, 0); > > > > if (recd_bytes < 0) { > > > > return; > > > > } else if (0 == recd_bytes) { > > > > syslog(LOG_ERR, > > "MDTM:socket_recv() > > = %d, conn lost with dh server, exiting library err 222:%d len:%d", > > recd_bytes, errno, > > > > > > local_len_buf); > > > > close(tcp_cb->DBSRsock); > > > > exit(0); > > > > > > > > This caused many other issues, so I think just returning won’t > work. > > > > > > > > Regards, > > > > Girish > > > > > > > > -----Original Message----- > > From: A V Mahesh [mailto:[email protected] > > <[email protected]>] > > Sent: Friday, February 20, 2015 1:38 PM > > To: Girish Nagaraj; [email protected] > > Subject: Re: [users] Issues with CPSv > > > > > > > > Hi, > > > > > > > > On 2/20/2015 1:19 PM, Girish Nagaraj wrote: > > > > > Hi , > > > > > > > > > > I think this is not connection loss, we are passing 0 (len of > > bytes > > > > > to be > > > > > read) to recv() function. Which returns back 0 received bytes. > > > > > > > > You mean, you are seeing issue similar to `TIPC ticket #1227 > > mds/tipc > > > > : protect mds application form zero bytes hacking messages` for TCP > as > > well ? > > > > > > > > -AVM > > > > > > > > > > > > > > local_len_buf = ncs_decode_16bit(&data); > > > > > > > > > > Is there mistake in decoding local_len_buf? > > > > > > > > > > Regards, > > > > > Girish > > > > > > > > > > -----Original Message----- > > > > > From: A V Mahesh [mailto:[email protected] > > <[email protected]> > > ] > > > > > Sent: Friday, February 20, 2015 11:03 AM > > > > > To: [email protected] > > > > > Subject: Re: [users] Issues with CPSv > > > > > > > > > > Hi, > > > > > > > > > > On 2/19/2015 3:42 PM, Girish Nagaraj wrote: > > > > >> local_len_buf turns out be 0, this causes recv() to return 0 and > > > > >> application exits. Is this programming bug?? > > > > > This is expected behavior , if any connection loss happens on TCP > > > > > socket will recives ZERO size bytes, this not related to CPSv. > > > > > > > > > > -AVM > > > > > > > > > > > > > > > On 2/19/2015 3:42 PM, Girish Nagaraj wrote: > > > > >> Hi, > > > > >> > > > > >> > > > > >> > > > > >> *Background*: > > > > >> > > > > >> Opensaf version: 4.5 > > > > >> > > > > >> Number of checkpoints used: 2 > > > > >> > > > > >> In our application we use CPSv to save application data and when > > > > >> application faults, it is restarted and it’s state is restored > back > > > > >> by reading data from checkpoints > > > > >> > > > > >> Model: Simplex > > > > >> > > > > >> > > > > >> > > > > >> * Issue faced:* > > > > >> > > > > >> application sometimes crashes, stack trace as below: > > > > >> > > > > >> > > > > >> > > > > >> Program received signal SIGSEGV, Segmentation fault. > > > > >> > > > > >> search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 > > > > >> "H\356\367\b") at patricia.c:94 > > > > >> > > > > >> 94 patricia.c: No such file or directory. > > > > >> > > > > >> (gdb) bt > > > > >> > > > > >> #0 search (pTree=pTree@entry=0x8f733e4, > key=key@entry=0xbfa0cdf8 > > > > >> "H\356\367\b") at patricia.c:94 > > > > >> > > > > >> #1 0xb76d0bef in ncs_patricia_tree_get > > (pTree=pTree@entry=0x8f733e4, > > > > >> pKey=pKey@entry=0xbfa0cdf8 "H\356\367\b") at patricia.c:434 > > > > >> > > > > >> #2 0xb7738493 in cpa_lcl_ckpt_node_get > > > > >> (lcl_ckpt_tree=lcl_ckpt_tree@entry=0x8f733e4, > > > > >> lc_hdl=lc_hdl@entry=0xbfa0cdf8, > lc_node=lc_node@entry=0xbfa0ce10) > > > > >> > > > > >> at cpa_db.c:195 > > > > >> > > > > >> #3 0xb7734d76 in saCkptCheckpointWrite > > (checkpointHandle=150466120, > > > > >> ioVector=0x92c6d28, numberOfElements=1320, > > > > >> > > > > >> > erroneousVectorIndex=erroneousVectorIndex@entry=0xbfa0d35c) > > at > > > > >> cpa_api.c:3134 > > > > >> > > > > >> > > > > >> > > > > >> (gdb) p pNode > > > > >> > > > > >> $2 = (NCS_PATRICIA_NODE *) 0x5e > > > > >> > > > > >> (gdb) p *pTree > > > > >> > > > > >> $4 = {root_node = {bit = -1, left = 0x8f7e9c0, right = > 0x8f733e4, > > > > >> key_info = 0x8f734b8 ""}, params = {key_size = 8, info_size = 0, > > > > >> actual_key_size = 0, > > > > >> > > > > >> node_size = 0}, n_nodes = 3} > > > > >> > > > > >> > > > > >> > > > > >> sometimes application exits with below message: > > > > >> > > > > >> > > > > >> > > > > >> Feb 19 15:13:31 controller2 RIB[28395]: MDTM:socket_recv() = 0, > > conn > > > > >> lost with dh server, exiting library err:0 len:0 > > > > >> > > > > >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO > > > > >> 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart > > > > >> probation timer started (timeout: 4000000000 ns) > > > > >> > > > > >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO Restarting a > > > > >> component of 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp > > > > >> restart count: 1) > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> Below is the modified code snippet from file > > > > >> osaf/libs/core/mds/mds_dt_trans.c > > > > >> > > > > >> > > > > >> > > > > >> } else if (2 == recd_bytes) { > > > > >> > > > > >> uint16_t local_len_buf = 0; > > > > >> > > > > >> > > > > >> > > > > >> data = tcp_cb->len_buff; > > > > >> > > > > >> local_len_buf = > > > > >> ncs_decode_16bit(&data); > > > > >> > > > > >> tcp_cb->buff_total_len = > > > > >> local_len_buf; > > > > >> > > > > >> > tcp_cb->num_by_read_for_len_buff > > = > > > > >> 2; > > > > >> > > > > >> > > > > >> > > > > >> if (NULL == (tcp_cb->buffer = > > > > >> calloc(1, (local_len_buf + 1)))) { > > > > >> > > > > >> /* Length + 2 is done > to > > > > >> reuse the same buffer > > > > >> > > > > >> while sending to > other > > > > >> nodes */ > > > > >> > > > > >> syslog(LOG_ERR, > "Memory > > > > >> allocation failed in dtm_intranode_processing"); > > > > >> > > > > >> return; > > > > >> > > > > >> } > > > > >> > > > > >> recd_bytes = > > recv(tcp_cb->DBSRsock, > > > > >> tcp_cb->buffer, local_len_buf, 0); > > > > >> > > > > >> if (recd_bytes < 0) { > > > > >> > > > > >> return; > > > > >> > > > > >> } else if (0 == recd_bytes) { > > > > >> > > > > >> syslog(LOG_ERR, > > > > >> "MDTM:socket_recv() = %d, conn lost with dh server, exiting > library > > > > >> err:%d len:%d", recd_bytes, errno, local_len_buf); > > > > >> > > > > >> > close(tcp_cb->DBSRsock); > > > > >> > > > > >> exit(0); *<<<<<<<EXITS > > > > >> HERE>>>>>>>>>>* > > > > >> > > > > >> } else if (local_len_buf > > > > > >> recd_bytes) { > > > > >> > > > > >> /* can happen only in > two > > > > >> cases, system call interrupt or half data, */ > > > > >> > > > > >> TRACE("less data recd, > > recd > > > > >> bytes = %d, actual len = %d", recd_bytes, > > > > >> > > > > >> local_len_buf); > > > > >> > > > > >> tcp_cb->bytes_tb_read > = > > > > >> tcp_cb->buff_total_len - recd_bytes; > > > > >> > > > > >> return; > > > > >> > > > > >> > > > > >> > > > > >> local_len_buf turns out be 0, this causes recv() to return 0 and > > > > >> application exits. Is this programming bug?? > > > > >> > > > > >> > > > > >> > > > > >> Could someone please help to resolve these issues. > > > > >> > > > > >> > > > > >> > > > > >> Regards, > > > > >> > > > > >> Girish > > > > >> > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > -------- Download BIRT iHub F-Type - The Free Enterprise-Grade > BIRT > > > > > Server from Actuate! Instantly Supercharge Your Business Reports > and > > > > > Dashboards with Interactivity, Sharing, Native Excel Exports, App > > > > > Integration & more Get technology previously reserved for > > > > > billion-dollar corporations, FREE > > > > > > > > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg. > > > > > clktrk _______________________________________________ > > > > > Opensaf-users mailing list > > > > > [email protected] > > > > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > > > > > > > > -- > > . > > > ---------------------------------------------------------------------- > > -------- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT > > Server from Actuate! Instantly Supercharge Your Business Reports > and > > Dashboards with Interactivity, Sharing, Native Excel Exports, App > > Integration & more Get technology previously reserved for > > billion-dollar corporations, FREE > > > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg. > > clktrk _______________________________________________ > > Opensaf-users mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > -- > . ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
