- **status**: assigned --> not-reproducible
- **Comment**:

Hi Girish,
 
Issue is NOT reproducible on Suse 11 , I did testing as flows :
 
1) Increased   share memory
   
    To increase the shared memory size, I did the following:
    Modify /etc/fstab line to look something like this:
   
    
------------------------------------------------------------------------------------------------------------------
      #vi /etc/fstab
         tmpfs      /dev/shm        tmpfs   defaults,size=1024g        0 0
  
     Remount tmpfs and verify :
         # mount -o remount tmpfs
 
        # df  /dev/shm/
        Filesystem      1K-blocks  Used  Available Use% Mounted on
       tmpfs          1073741824   972 1073740852   1% /dev/shm
    
    
------------------------------------------------------------------------------------------------------------------
    
2) Increased system buffer     
   
     To increase the    tcp_rmem , tcp_wmem , wmem_max and rmem_max
     Modify /etc/init.d/opensafd by adding follwing in start() function it look 
something like this:
 
    
-----------------------------------------------------------------------------------------------------------------
       start() {
             
           sysctl -w net.core.wmem_max=13107100
           sysctl -w net.core.rmem_max=13107100
           sysctl -w net.ipv4.tcp_rmem="4096 87380 13107100"
           sysctl -w net.ipv4.tcp_wmem="4096 87380 13107100"
    
------------------------------------------------------------------------------------------------------------------
    

3) Corrected  your test application cpsv_test_app.c    
    
       Modify cpsv_test_app.c as follwos it look something like this and 
re-rebuild ckpt_demo:

    
------------------------------------------------------------------------------------------------------------------
      ckptCreateAttr.checkpointSize = (400 << 20);
      ckptCreateAttr.retentionDuration= 100000;
      ckptCreateAttr.maxSections= 2;    =============>  
ckptCreateAttr.maxSections= 1;  ////// max checkpointSize is (400 << 20) it can 
hold one section of (400 << 20))
       ckptCreateAttr.maxSectionSize = (400 << 20);
       ckptCreateAttr.maxSectionIdSize = 4;    
    
       #gcc cpsv_main_app.c  cpsv_test_app.c -o ckpt_demo -lSaCkpt
    
------------------------------------------------------------------------------------------------------------------

4) Test Result
================================================================================

SC-1:/avm/opensaf_app/cpsv_applications/girish # ./ckpt_demo 1
*******************************************************************
Demonstrating Checkpoint Service Usage with a collocated Checkpoint
*******************************************************************
Initialising With Checkpoint Service....
 
CPSV:CPA:ONPASSED
Opening Collocated Checkpoint = safCkpt=DemoCkpt,safApp=safCkptService with 
create flags.... with size : 419430400
PASSED
Setting the Active Replica for my checkpoint ....       PASSED
....................................
CheckpointData being written = ""
DataOffset = 396680001 ....
Failed rc=12
Writing to Checkpoint safCkpt=DemoCkpt,safApp=safCkptService ....
Section-Id = 11 ....
CheckpointData being written = ""
DataOffset = 405120001 ....
Failed rc=12
Press <Enter> key to continue...
 
Synchronizing My Checkpoint being called ....
PASSED
Unlink My Checkpoint ....       PASSED
Ckpt Closed ....        PASSED
Ckpt Finalize being called .... PASSED
SC-1:/avm/opensaf_app/cpsv_applications/girish #


SC-2:/avm/opensaf_app/cpsv_applications/girish # ./ckpt_demo 0
*******************************************************************
Demonstrating Checkpoint Service Usage with a collocated Checkpoint
*******************************************************************
Initialising With Checkpoint Service....
 
CPSV:CPA:ONPASSED
Opening Collocated Checkpoint = safCkpt=DemoCkpt,safApp=safCkptService with 
create flags.... with size : 419430400
PASSED
Waiting to Read from Checkpoint safCkpt=DemoCkpt,safApp=safCkptService....
Press <Enter> key to continue...
 
Checkpoint Data Read = ""
Failed rc=12
Synchronizing My Checkpoint being called ....
Failed rc=12
Ckpt Closed ....        PASSED
Ckpt Finalize being called .... PASSED
================================================================================
 
- AVM  




---

** [tickets:#1436] MDS (TCP transport) fragment gets dropped, not received on 
standby node**

**Status:** not-reproducible
**Milestone:** 4.6.1
**Created:** Thu Aug 06, 2015 06:47 AM UTC by Girish
**Last Updated:** Wed Sep 23, 2015 04:46 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[cpsv_test_app.c](https://sourceforge.net/p/opensaf/tickets/1436/attachment/cpsv_test_app.c)
 (8.5 kB; text/x-csrc)


Opensaf version: 4.6
Linux: Standard Fedora 22 release, no additional patches required
default wmem_max/rmem_max values 
default buffer sizes for MDS_SOCK_SND_RCV_BUF_SIZE and DTM_SOCK_SND_RCV_BUF_SIZE
Active-standby model
opensaf run as root user/group

Steps:
 1. start opensaf on node1 (active) and node2 (standby)
 2. start ckpt_demo (modified application attached) on active node, ./ckpt_demo 
1
 3. wait till all the data is checkpointed
 4. start ckpt_demo on standby node, ./ckpt_demo 0
 
 Notice Error messages in mds.log:
 
 MDTM: Some stale message recd, hence dropping adest=
 
 My investigation is that one of the fragment is lost, active node sends - 
where as standby by node does not receive.
 
 mds log on standby:
 
 May 29  4:30:03.089974 <8461> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.089995 <8461> ERR    |before mds_mdtm_process_recvdata fun-call 
11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.090014 <8461> ERR    |MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3049, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35817
May 29  4:30:03.090032 <8461> ERR    |MDTM: Reassembling in FULL UB
May 29  4:30:03.090174 <8461> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.090198 <8461> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.090216 <8461> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.090238 <8461> ERR    |before mds_mdtm_process_recvdata fun-call 
11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.090257 <8461> ERR    |MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3050, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35818
May 29  4:30:03.090275 <8461> ERR    |MDTM: Reassembling in FULL UB
May 29  4:30:03.090735 <8461> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.090762 <8461> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.090780 <8461> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.090801 <8461> ERR    |before mds_mdtm_process_recvdata fun-call 
11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.090820 <8461> ERR    |MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3051, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35819
May 29  4:30:03.090838 <8461> ERR    |MDTM: Reassembling in FULL UB
May 29  4:30:03.090978 <8461> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.091028 <8461> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.091047 <8461> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.091068 <8461> ERR    |before mds_mdtm_process_recvdata fun-call 
11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.091087 <8461> ERR    |MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3053, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35821
May 29  4:30:03.091106 <8461> ERR    |MDTM: ERROR Frag recd is not next frag so 
dropping adest=<0x0002020f000064e2>
May 29  4:30:03.091125 <8461> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.091143 <8461> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.091160 <8461> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.091180 <8461> ERR    |before mds_mdtm_process_recvdata fun-call 
11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.091198 <8461> ERR    |MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3054, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35822
May 29  4:30:03.091216 <8461> ERR    |MDTM: Message is dropped as msg is out of 
seq TRANSPOR-ID=<0x0002020f000064e2> 
May 29  4:30:03.091235 <8461> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.091283 <8461> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.091302 <8461> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp


mds log on active:

May 29  4:29:36.021518 <25826> ERR    |before mds_mdtm_process_recvdata 
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:29:36.021537 <25826> ERR    |MDTM: Recd message with Fragment 
Seqnum=5, frag_num=3049, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35817
May 29  4:29:36.021554 <25826> ERR    |MDTM: Reassembling in flat UB
May 29  4:29:36.021702 <25995> ERR    |successfully sent message, send_len=1456
May 29  4:29:36.021729 <25995> ERR    |MDTM:2 Sending message with Service 
Seqno=4, Fragment Seqnum=5, frag_num=35818, TO Dest_Tipc_id=<0x0002020f:25826>
May 29  4:29:36.021778 <25826> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:29:36.021800 <25826> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:29:36.021817 <25826> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:29:36.021837 <25826> ERR    |before mds_mdtm_process_recvdata 
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:29:36.021860 <25826> ERR    |MDTM: Recd message with Fragment 
Seqnum=5, frag_num=3050, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35818
May 29  4:29:36.021878 <25826> ERR    |MDTM: Reassembling in flat UB
May 29  4:29:36.022024 <25995> ERR    |successfully sent message, send_len=1456
May 29  4:29:36.022050 <25995> ERR    |MDTM:2 Sending message with Service 
Seqno=4, Fragment Seqnum=5, frag_num=35819, TO Dest_Tipc_id=<0x0002020f:25826>
May 29  4:29:36.022088 <25826> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:29:36.022109 <25826> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:29:36.022126 <25826> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:29:36.022149 <25826> ERR    |before mds_mdtm_process_recvdata 
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:29:36.022168 <25826> ERR    |MDTM: Recd message with Fragment 
Seqnum=5, frag_num=3051, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35819
May 29  4:29:36.022185 <25826> ERR    |MDTM: Reassembling in flat UB
May 29  4:29:36.022330 <25995> ERR    |successfully sent message, send_len=1456
May 29  4:29:36.022357 <25995> ERR    |MDTM:2 Sending message with Service 
Seqno=4, Fragment Seqnum=5, frag_num=35820, TO Dest_Tipc_id=<0x0002020f:25826>
May 29  4:29:36.022393 <25826> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:29:36.022415 <25826> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:29:36.022431 <25826> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:29:36.022451 <25826> ERR    |before mds_mdtm_process_recvdata 
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:29:36.022470 <25826> ERR    |MDTM: Recd message with Fragment 
Seqnum=5, frag_num=3052, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35820
May 29  4:29:36.022487 <25826> ERR    |MDTM: Reassembling in flat UB
May 29  4:29:36.022635 <25995> ERR    |successfully sent message, send_len=1456
May 29  4:29:36.022662 <25995> ERR    |MDTM:2 Sending message with Service 
Seqno=4, Fragment Seqnum=5, frag_num=35821, TO Dest_Tipc_id=<0x0002020f:25826>
May 29  4:29:36.022698 <25826> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:29:36.022719 <25826> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:29:36.022736 <25826> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:29:36.022756 <25826> ERR    |before mds_mdtm_process_recvdata 
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:29:36.022790 <25826> ERR    |MDTM: Recd message with Fragment 
Seqnum=5, frag_num=3053, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35821
May 29  4:29:36.022807 <25826> ERR    |MDTM: Reassembling in flat UB
May 29  4:29:36.022955 <25995> ERR    |successfully sent message, send_len=1456
May 29  4:29:36.022982 <25995> ERR    |MDTM:2 Sending message with Service 
Seqno=4, Fragment Seqnum=5, frag_num=35822, TO Dest_Tipc_id=<0x0002020f:25826>
May 29  4:29:36.023019 <25826> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:29:36.023040 <25826> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:29:36.023057 <25826> ERR    |>>>>>>>>>>>> 
mdtm_process_poll_recv_data_tcp
May 29  4:29:36.023077 <25826> ERR    |before mds_mdtm_process_recvdata 
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29  4:29:36.023096 <25826> ERR    |MDTM: Recd message with Fragment 
Seqnum=5, frag_num=3054, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35822
May 29  4:29:36.023113 <25826> ERR    |MDTM: Reassembling in flat UB
May 29  4:29:36.023258 <25995> ERR    |successfully sent message, send_len=1456
May 29  4:29:36.023285 <25995> ERR    |MDTM:2 Sending message with Service 
Seqno=4, Fragment Seqnum=5, frag_num=35823, TO Dest_Tipc_id=<0x0002020f:25826>
May 29  4:29:36.023322 <25826> ERR    |mdtm_process_recv_events_tcp: pollres=1
May 29  4:29:36.023342 <25826> ERR    |mdtm_process_recv_events_tcp: 
pfd[0].revents=1


Notice that pkt_type=35820 is not received on standby node.

In our application we require to start standby node and it is expected that it 
receives all the checkpoint data from active


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to