Hi,

To confirm/isolate the problems further , test your application with 
TIPC transport with ticket #1227 fix ( both 4.3 & 4.5) .
and provide your observations.

If issue is NOT reproducible with TIPC transport , as a workaround 
prevent sending ZERO size ( hack message ) in your ckpt application for 
TCP  transport
and raise a ticket with all details as Mathi explained.

-AVM


On 2/20/2015 3:33 PM, Girish Nagaraj wrote:
>
> Hi,
>
> Yes, similar issue in TCP also: exits with message:
>
> Feb 20 15:24:59 fedvm1 RIB[28549]: MDTM:socket_recv() = 0, conn lost 
> with dh server, exiting library err :Success
>
> Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO 
> 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart 
> probation timer started (timeout: 4000000000 ns)
>
> Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO Restarting a component of 
> 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp restart count: 1)
>
> Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO 
> 'safComp=ribd,safSu=SU1,safSg=zebos-simplex,safApp=zebos' faulted due 
> to 'avaDown' : Recovery is 'componentRestart'
>
> I experimented with code changes:
>
> recd_bytes = recv(tcp_cb->DBSRsock, tcp_cb->len_buff, 2, MSG_NOSIGNAL);
>
>                         if (0 == recd_bytes) {
>
> syslog(LOG_ERR, "MDTM:socket_recv() = %d, conn lost with dh server, 
> exiting library err 111:%d", recd_bytes, errno);
>
> close(tcp_cb->DBSRsock);
>
> exit(0);
>
>                         } else if (2 == recd_bytes) {
>
> uint16_t local_len_buf = 0;
>
>                                 data = tcp_cb->len_buff;
>
> local_len_buf = ncs_decode_16bit(&data);
>
> /* MY CHANGE START */
>
> *if (0 == local_len_buf)*
>
> *return;*
>
> /* MY CHANGE END */
>
> tcp_cb->buff_total_len = local_len_buf;
>
> tcp_cb->num_by_read_for_len_buff = 2;
>
>                                 if (NULL == (tcp_cb->buffer = 
> calloc(1, (local_len_buf + 1)))) {
>
> /* Length + 2 is done to reuse the same buffer
>
>                                            while sending to other nodes */
>
> syslog(LOG_ERR, "Memory allocation failed in dtm_intranode_processing");
>
> return;
>
>                                 }
>
> recd_bytes = recv(tcp_cb->DBSRsock, tcp_cb->buffer, local_len_buf, 0);
>
>                                 if (recd_bytes < 0) {
>
> return;
>
>                                 } else if (0 == recd_bytes) {
>
> syslog(LOG_ERR, "MDTM:socket_recv() = %d, conn lost with dh server, 
> exiting library err 222:%d len:%d", recd_bytes, errno,
>
> local_len_buf);
>
> close(tcp_cb->DBSRsock);
>
> exit(0);
>
>  This caused many other issues, so I think just returning won’t work.
>
> Regards,
>
> Girish
>
> -----Original Message-----
> From: A V Mahesh [mailto:[email protected] 
> <mailto:[email protected]>]
> Sent: Friday, February 20, 2015 1:38 PM
> To: Girish Nagaraj; [email protected] 
> <mailto:[email protected]>
> Subject: Re: [users] Issues with CPSv
>
> Hi,
>
> On 2/20/2015 1:19 PM, Girish Nagaraj wrote:
>
> > Hi ,
>
> >
>
> >   I think this is not connection loss, we are passing 0 (len of bytes
>
> > to be
>
> > read) to recv() function. Which returns back 0 received bytes.
>
> You mean, you are seeing issue   similar to `TIPC ticket #1227 mds/tipc
>
> : protect mds application form zero bytes hacking messages` for TCP as 
> well ?
>
> -AVM
>
> >
>
> >       local_len_buf = ncs_decode_16bit(&data);
>
> >
>
> >   Is there mistake in decoding local_len_buf?
>
> >
>
> > Regards,
>
> > Girish
>
> >
>
> > -----Original Message-----
>
> > From: A V Mahesh [mailto:[email protected]]
>
> > Sent: Friday, February 20, 2015 11:03 AM
>
> > To: [email protected] 
> <mailto:[email protected]>
>
> > Subject: Re: [users] Issues with CPSv
>
> >
>
> > Hi,
>
> >
>
> > On 2/19/2015 3:42 PM, Girish Nagaraj wrote:
>
> >> local_len_buf turns out be 0, this causes recv() to return 0 and
>
> >> application exits. Is this programming bug??
>
> > This is expected behavior , if any connection loss happens on TCP
>
> > socket will recives ZERO  size bytes, this not related to CPSv.
>
> >
>
> > -AVM
>
> >
>
> >
>
> > On 2/19/2015 3:42 PM, Girish Nagaraj wrote:
>
> >> Hi,
>
> >>
>
> >>
>
> >>
>
> >> *Background*:
>
> >>
>
> >> Opensaf version: 4.5
>
> >>
>
> >> Number of checkpoints used: 2
>
> >>
>
> >> In our application we use CPSv to save application data and when
>
> >> application faults, it is restarted and it’s state is restored back
>
> >> by reading data from checkpoints
>
> >>
>
> >> Model: Simplex
>
> >>
>
> >>
>
> >>
>
> >> * Issue faced:*
>
> >>
>
> >>     application sometimes crashes, stack trace as below:
>
> >>
>
> >>
>
> >>
>
> >> Program received signal SIGSEGV, Segmentation fault.
>
> >>
>
> >> search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8
>
> >> "H\356\367\b") at patricia.c:94
>
> >>
>
> >> 94      patricia.c: No such file or directory.
>
> >>
>
> >> (gdb) bt
>
> >>
>
> >> #0  search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8
>
> >> "H\356\367\b") at patricia.c:94
>
> >>
>
> >> #1  0xb76d0bef in ncs_patricia_tree_get (pTree=pTree@entry=0x8f733e4,
>
> >> pKey=pKey@entry=0xbfa0cdf8 "H\356\367\b") at patricia.c:434
>
> >>
>
> >> #2  0xb7738493 in cpa_lcl_ckpt_node_get
>
> >> (lcl_ckpt_tree=lcl_ckpt_tree@entry=0x8f733e4,
>
> >> lc_hdl=lc_hdl@entry=0xbfa0cdf8, lc_node=lc_node@entry=0xbfa0ce10)
>
> >>
>
> >>       at cpa_db.c:195
>
> >>
>
> >> #3  0xb7734d76 in saCkptCheckpointWrite (checkpointHandle=150466120,
>
> >> ioVector=0x92c6d28, numberOfElements=1320,
>
> >>
>
> >> erroneousVectorIndex=erroneousVectorIndex@entry=0xbfa0d35c) at
>
> >> cpa_api.c:3134
>
> >>
>
> >>
>
> >>
>
> >> (gdb) p pNode
>
> >>
>
> >> $2 = (NCS_PATRICIA_NODE *) 0x5e
>
> >>
>
> >> (gdb) p *pTree
>
> >>
>
> >> $4 = {root_node = {bit = -1, left = 0x8f7e9c0, right = 0x8f733e4,
>
> >> key_info = 0x8f734b8 ""}, params = {key_size = 8, info_size = 0,
>
> >> actual_key_size = 0,
>
> >>
>
> >>       node_size = 0}, n_nodes = 3}
>
> >>
>
> >>
>
> >>
>
> >>     sometimes application exits with below message:
>
> >>
>
> >>
>
> >>
>
> >> Feb 19 15:13:31 controller2 RIB[28395]: MDTM:socket_recv() = 0, conn
>
> >> lost with dh server, exiting library err:0 len:0
>
> >>
>
> >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO
>
> >> 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart
>
> >> probation timer started (timeout: 4000000000 ns)
>
> >>
>
> >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO Restarting a
>
> >> component of 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp
>
> >> restart count: 1)
>
> >>
>
> >>
>
> >>
>
> >>
>
> >>
>
> >> Below is the modified code snippet from file
>
> >> osaf/libs/core/mds/mds_dt_trans.c
>
> >>
>
> >>
>
> >>
>
> >> } else if (2 == recd_bytes) {
>
> >>
>
> >>  uint16_t local_len_buf = 0;
>
> >>
>
> >>
>
> >>
>
> >> data = tcp_cb->len_buff;
>
> >>
>
> >> local_len_buf =
>
> >> ncs_decode_16bit(&data);
>
> >>
>
> >> tcp_cb->buff_total_len =
>
> >> local_len_buf;
>
> >>
>
> >> tcp_cb->num_by_read_for_len_buff =
>
> >> 2;
>
> >>
>
> >>
>
> >>
>
> >> if (NULL == (tcp_cb->buffer =
>
> >> calloc(1, (local_len_buf + 1)))) {
>
> >>
>
> >>        /* Length + 2 is done to
>
> >> reuse the same buffer
>
> >>
>
> >> while sending to other
>
> >> nodes */
>
> >>
>
> >> syslog(LOG_ERR, "Memory
>
> >> allocation failed in dtm_intranode_processing");
>
> >>
>
> >> return;
>
> >>
>
> >> }
>
> >>
>
> >> recd_bytes = recv(tcp_cb->DBSRsock,
>
> >> tcp_cb->buffer, local_len_buf, 0);
>
> >>
>
> >>            if (recd_bytes < 0) {
>
> >>
>
> >> return;
>
> >>
>
> >> } else if (0 == recd_bytes) {
>
> >>
>
> >> syslog(LOG_ERR,
>
> >> "MDTM:socket_recv() = %d, conn lost with dh server, exiting library
>
> >> err:%d len:%d", recd_bytes, errno, local_len_buf);
>
> >>
>
> >> close(tcp_cb->DBSRsock);
>
> >>
>
> >> exit(0); *<<<<<<<EXITS
>
> >> HERE>>>>>>>>>>*
>
> >>
>
> >> } else if (local_len_buf >
>
> >> recd_bytes) {
>
> >>
>
> >> /* can happen only in two
>
> >> cases, system call interrupt or half data, */
>
> >>
>
> >>                                  TRACE("less data recd, recd
>
> >> bytes = %d, actual len = %d", recd_bytes,
>
> >>
>
> >> local_len_buf);
>
> >>
>
> >> tcp_cb->bytes_tb_read =
>
> >> tcp_cb->buff_total_len - recd_bytes;
>
> >>
>
> >> return;
>
> >>
>
> >>
>
> >>
>
> >> local_len_buf turns out be 0, this causes recv() to return 0 and
>
> >> application exits. Is this programming bug??
>
> >>
>
> >>
>
> >>
>
> >> Could someone please help to resolve these issues.
>
> >>
>
> >>
>
> >>
>
> >> Regards,
>
> >>
>
> >> Girish
>
> >>
>
> >
>
> > ----------------------------------------------------------------------
>
> > -------- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT
>
> > Server from Actuate! Instantly Supercharge Your Business Reports and
>
> > Dashboards with Interactivity, Sharing, Native Excel Exports, App
>
> > Integration & more Get technology previously reserved for
>
> > billion-dollar corporations, FREE
>
> > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.
>
> > clktrk _______________________________________________
>
> > Opensaf-users mailing list
>
> > [email protected] 
> <mailto:[email protected]>
>
> > https://lists.sourceforge.net/lists/listinfo/opensaf-users
>
> >
>
>
> . 

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to