Hi Girish,
Please also try to reproduce the problem on your Fedora 22 by increasing the
system buffers,
so that we can ignore system buffers factor in debugging
For example :
sysctl -w net.core.wmem_max=13107100
sysctl -w net.core.rmem_max=13107100
sysctl -w net.ipv4.tcp_rmem="4096 87380 13107100"
sysctl -w net.ipv4.tcp_wmem="4096 87380 13107100"
---
** [tickets:#1436] MDS (TCP transport) fragment gets dropped, not received on
standby node**
**Status:** assigned
**Milestone:** 4.6.1
**Created:** Thu Aug 06, 2015 06:47 AM UTC by Girish
**Last Updated:** Wed Sep 23, 2015 04:41 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**
-
[cpsv_test_app.c](https://sourceforge.net/p/opensaf/tickets/1436/attachment/cpsv_test_app.c)
(8.5 kB; text/x-csrc)
Opensaf version: 4.6
Linux: Standard Fedora 22 release, no additional patches required
default wmem_max/rmem_max values
default buffer sizes for MDS_SOCK_SND_RCV_BUF_SIZE and DTM_SOCK_SND_RCV_BUF_SIZE
Active-standby model
opensaf run as root user/group
Steps:
1. start opensaf on node1 (active) and node2 (standby)
2. start ckpt_demo (modified application attached) on active node, ./ckpt_demo
1
3. wait till all the data is checkpointed
4. start ckpt_demo on standby node, ./ckpt_demo 0
Notice Error messages in mds.log:
MDTM: Some stale message recd, hence dropping adest=
My investigation is that one of the fragment is lost, active node sends -
where as standby by node does not receive.
mds log on standby:
May 29 4:30:03.089974 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.089995 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.090014 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3049, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35817
May 29 4:30:03.090032 <8461> ERR |MDTM: Reassembling in FULL UB
May 29 4:30:03.090174 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.090198 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.090216 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.090238 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.090257 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3050, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35818
May 29 4:30:03.090275 <8461> ERR |MDTM: Reassembling in FULL UB
May 29 4:30:03.090735 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.090762 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.090780 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.090801 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.090820 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3051, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35819
May 29 4:30:03.090838 <8461> ERR |MDTM: Reassembling in FULL UB
May 29 4:30:03.090978 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.091028 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.091047 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.091068 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.091087 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3053, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35821
May 29 4:30:03.091106 <8461> ERR |MDTM: ERROR Frag recd is not next frag so
dropping adest=<0x0002020f000064e2>
May 29 4:30:03.091125 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.091143 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.091160 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:30:03.091180 <8461> ERR |before mds_mdtm_process_recvdata fun-call
11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:30:03.091198 <8461> ERR |MDTM: Recd message with Fragment
Seqnum=18, frag_num=3054, from src_Tipc_id=<0x0002020f:25826>, pkt_type=35822
May 29 4:30:03.091216 <8461> ERR |MDTM: Message is dropped as msg is out of
seq TRANSPOR-ID=<0x0002020f000064e2>
May 29 4:30:03.091235 <8461> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:30:03.091283 <8461> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:30:03.091302 <8461> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
mds log on active:
May 29 4:29:36.021518 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.021537 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3049, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35817
May 29 4:29:36.021554 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.021702 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.021729 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35818, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.021778 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.021800 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.021817 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.021837 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.021860 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3050, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35818
May 29 4:29:36.021878 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022024 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022050 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35819, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.022088 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.022109 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.022126 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.022149 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.022168 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3051, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35819
May 29 4:29:36.022185 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022330 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022357 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35820, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.022393 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.022415 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.022431 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.022451 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.022470 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3052, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35820
May 29 4:29:36.022487 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022635 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022662 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35821, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.022698 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.022719 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.022736 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.022756 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.022790 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3053, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35821
May 29 4:29:36.022807 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.022955 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.022982 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35822, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.023019 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.023040 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
May 29 4:29:36.023057 <25826> ERR |>>>>>>>>>>>>
mdtm_process_poll_recv_data_tcp
May 29 4:29:36.023077 <25826> ERR |before mds_mdtm_process_recvdata
fun-call 11111, recd_bytes=1454, buff_toal_len=1454
May 29 4:29:36.023096 <25826> ERR |MDTM: Recd message with Fragment
Seqnum=5, frag_num=3054, from src_Tipc_id=<0x0002020f:25995>, pkt_type=35822
May 29 4:29:36.023113 <25826> ERR |MDTM: Reassembling in flat UB
May 29 4:29:36.023258 <25995> ERR |successfully sent message, send_len=1456
May 29 4:29:36.023285 <25995> ERR |MDTM:2 Sending message with Service
Seqno=4, Fragment Seqnum=5, frag_num=35823, TO Dest_Tipc_id=<0x0002020f:25826>
May 29 4:29:36.023322 <25826> ERR |mdtm_process_recv_events_tcp: pollres=1
May 29 4:29:36.023342 <25826> ERR |mdtm_process_recv_events_tcp:
pfd[0].revents=1
Notice that pkt_type=35820 is not received on standby node.
In our application we require to start standby node and it is expected that it
receives all the checkpoint data from active
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets