- **status**: assigned --> not-reproducible
- **Comment**:
Hi Girish,
Issue is NOT reproducible on Suse 11 , I did testing as flows :
1) Increased share memory
To increase the shared memory size, I did the following:
Modify /etc/fstab line to look something like this:
------------------------------------------------------------------------------------------------------------------
#vi /etc/fstab
tmpfs /dev/shm tmpfs defaults,size=1024g 0 0
Remount tmpfs and verify :
# mount -o remount tmpfs
# df /dev/shm/
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 1073741824 972 1073740852 1% /dev/shm
------------------------------------------------------------------------------------------------------------------
2) Increased system buffer
To increase the tcp_rmem , tcp_wmem , wmem_max and rmem_max
Modify /etc/init.d/opensafd by adding follwing in start() function it look
something like this:
-----------------------------------------------------------------------------------------------------------------
start() {
sysctl -w net.core.wmem_max=13107100
sysctl -w net.core.rmem_max=13107100
sysctl -w net.ipv4.tcp_rmem="4096 87380 13107100"
sysctl -w net.ipv4.tcp_wmem="4096 87380 13107100"
------------------------------------------------------------------------------------------------------------------
3) Corrected your test application cpsv_test_app.c
Modify cpsv_test_app.c as follwos it look something like this and
re-rebuild ckpt_demo:
------------------------------------------------------------------------------------------------------------------
ckptCreateAttr.checkpointSize = (400 << 20);
ckptCreateAttr.retentionDuration= 100000;
ckptCreateAttr.maxSections= 2; =============>
ckptCreateAttr.maxSections= 1; ////// max checkpointSize is (400 << 20) it can
hold one section of (400 << 20))
ckptCreateAttr.maxSectionSize = (400 << 20);
ckptCreateAttr.maxSectionIdSize = 4;
#gcc cpsv_main_app.c cpsv_test_app.c -o ckpt_demo -lSaCkpt
------------------------------------------------------------------------------------------------------------------
4) Test Result
================================================================================
SC-1:/avm/opensaf_app/cpsv_applications/girish # ./ckpt_demo 1
*******************************************************************
Demonstrating Checkpoint Service Usage with a collocated Checkpoint
*******************************************************************
Initialising With Checkpoint Service....
CPSV:CPA:ONPASSED
Opening Collocated Checkpoint = safCkpt=DemoCkpt,safApp=safCkptService with
create flags.... with size : 419430400
PASSED
Setting the Active Replica for my checkpoint .... PASSED
....................................
CheckpointData being written = ""
DataOffset = 396680001 ....
Failed rc=12
Writing to Checkpoint safCkpt=DemoCkpt,safApp=safCkptService ....
Section-Id = 11 ....
CheckpointData being written = ""
DataOffset = 405120001 ....
Failed rc=12
Press <Enter> key to continue...
Synchronizing My Checkpoint being called ....
PASSED
Unlink My Checkpoint .... PASSED
Ckpt Closed .... PASSED
Ckpt Finalize being called .... PASSED
SC-1:/avm/opensaf_app/cpsv_applications/girish #
SC-2:/avm/opensaf_app/cpsv_applications/girish # ./ckpt_demo 0
*******************************************************************
Demonstrating Checkpoint Service Usage with a collocated Checkpoint
*******************************************************************
Initialising With Checkpoint Service....
CPSV:CPA:ONPASSED
Opening Collocated Checkpoint = safCkpt=DemoCkpt,safApp=safCkptService with
create flags.... with size : 419430400
PASSED
Waiting to Read from Checkpoint safCkpt=DemoCkpt,safApp=safCkptService....
Press <Enter> key to continue...
Checkpoint Data Read = ""
Failed rc=12
Synchronizing My Checkpoint being called ....
Failed rc=12
Ckpt Closed .... PASSED
Ckpt Finalize being called .... PASSED
================================================================================
- AVM
---
** [tickets:#1436] MDS (TCP transport) fragment gets dropped, not received on
standby node**
**Status:** not-reproducible
**Milestone:** 4.6.1
**Created:** Thu Aug 06, 2015 06:47 AM UTC by Girish
**Last Updated:** Wed Sep 23, 2015 04:46 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**
-
[cpsv_test_app.c](https://sourceforge.net/p/opensaf/tickets/1436/attachment/cpsv_test_app.c)
(8.5 kB; text/x-csrc)
Opensaf version: 4.6
Linux: Standard Fedora 22 release, no additional patches required
default wmem_max/rmem_max values
default buffer sizes for MDS_SOCK_SND_RCV_BUF_SIZE and DTM_SOCK_SND_RCV_BUF_SIZE
Active-standby model
opensaf run as root user/group
Steps:
1. start opensaf on node1 (active) and node2 (standby)
2. start ckpt_demo (modified application attached) on active node, ./ckpt_demo
1
3. wait till all the data is checkpointed
4. start ckpt_demo on standby node, ./ckpt_demo 0
Notice Error messages in mds.log:
MDTM: Some stale message recd, hence dropping adest=
My investigation is that one of the fragment is lost, active node sends -
where as standby by node does not receive.
mds log on standby:
May 29 4:30:03.089974 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.089995 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.090014 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3049, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35817
May 29 4:30:03.090032 <8461> ERR |MDTM: Reassembling in FULL UB
May 29 4:30:03.090174 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.090198 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.090216 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.090238 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.090257 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3050, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35818
May 29 4:30:03.090275 <8461> ERR |MDTM: Reassembling in FULL UB
May 29 4:30:03.090735 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.090762 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.090780 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.090801 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.090820 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3051, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35819
May 29 4:30:03.090838 <8461> ERR |MDTM: Reassembling in FULL UB
May 29 4:30:03.090978 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.091028 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.091047 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.091068 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.091087 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3053, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35821
May 29 4:30:03.091106 <8461> ERR |MDTM: ERROR Frag recd is not next frag so
dropping adest=<0x0002020f000064e2>
May 29 4:30:03.091125 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.091143 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.091160 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.091180 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.091198 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3054, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35822
May 29 4:30:03.091216 <8461> ERR |MDTM: Message is dropped as msg is out of
seq TRANSPOR-ID=<0x0002020f000064e2>
May 29 4:30:03.091235 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.091283 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.091302 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
mds log on active:
May 29 4:29:36.021518 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.021537 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3049, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35817
May 29 4:29:36.021554 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.021702 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.021729 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35818, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.021778 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.021800 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.021817 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.021837 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.021860 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3050, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35818
May 29 4:29:36.021878 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022024 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022050 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35819, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.022088 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.022109 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.022126 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.022149 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.022168 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3051, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35819
May 29 4:29:36.022185 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022330 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022357 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35820, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.022393 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.022415 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.022431 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.022451 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.022470 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3052, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35820
May 29 4:29:36.022487 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022635 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022662 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35821, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.022698 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.022719 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.022736 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.022756 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.022790 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3053, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35821
May 29 4:29:36.022807 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022955 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022982 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35822, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.023019 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.023040 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.023057 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.023077 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.023096 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3054, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35822
May 29 4:29:36.023113 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.023258 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.023285 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35823, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.023322 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.023342 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
Notice that pkt_type=35820 is not received on standby node.
In our application we require to start standby node and it is expected that it
receives all the checkpoint data from active
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets