Steve Hill wrote: > On Wed, 3 Jan 2007, Sridhar Samudrala wrote: > > Sorry for the delay in replying. > >> No. lksctp-developers mailing list is still the best place for SCTP related >> discussions. You can subscribe and look in the archives at >> http://lists.sourceforge.net/lists/listinfo/lksctp-developers > > Hmm, I had a look there and it seemed reasonably inactive and overrun by > spam.. (And I've been unable to subscribe). > >> How are the 2 machines connected? Are they connected directly or >> via a router? > > They are currently connected together directly through crossover cables. > >> Do you see both the addresses when you do cat /proc/net/sctp/assocs >> after the association is established on both the peers? > > Yes, the contents of /proc/net/sctp/assocs looks correct. > >> How are you dropping traffic? You could try simulating failover by >> bringing down the interface or physically removing the link. > > I have been using iptables to drop SCTP packets on both the INPUT and > OUTPUT chains. However, I get the same results if I just unplug the > network cable (using iptables is easier for my testing since I don't have > to crawl around behind the test systems :) > >>> 1. Sometimes, just after failing over to the second path I see an ABORT. >> This seems to indicate that somehow the app has terminated. > > The abort _appears_ to be caused by a retransmit timer expiring, causing > the SCTP stack to tear down the association. However, I haven't done much > investigation of this problem yet - I've been focussing on the second > problem since it seems to happen more frequently. > >>> 2. More frequently, the association stays up indefinately, with heartbeat >>> requests and acks on the second path, but no data chunks are sent even >>> though the transmit queue on the transmitting end appears to be full and >>> the socket is blocking writes. >> This is strange. Can you collect tcpdump traces on sender and receiver when >> this happens? > > I've taken dumps of the data on the wire for both paths: > http://www.nexusuk.org/~steve/sctp/path1.pcap > http://www.nexusuk.org/~steve/sctp/path2.pcap
Taking a look at these it does appear to complete stall... There are some rather interesting retransmission that don't look quite right... > > I can't see anything odd in the network traffic - it just stops as if it > has no more data to send. However, the socket appears to still be > blocking so the application cannot give it any new data. > > This seems to be a problem with the abandonment functionality: > 1. Transmit chunk 1. The transmitted list now contains chunk 1. > 2. Chunk 1 and it's retransmissions get lost on the network. > 3. Abandon chunk 1. The transmitted list is now empty. This causes a FORWARD TSN chunk to be sent to the peer telling him to advance CTSN to that of chunk 1. > 4. Transmit chunk 2. the transmitted list now contains chunk 2 > 5. Receive a gap-ack for chunk 2, indicating that chunk 1 is missing. Yes, but at this point, we will regenerate the FORWARD TSN since chunk1 is still on the abandoned list. > At this point, the T3 timer is disabled at the bottom of > sctp_check_transmitted() since all the chunks in the transmitted queue are > gap-acked. The whole connection now stalls, waiting for the SACK for > chunk 1 that will never arrive. > I'll look some more at this... -vlad > It should be noted that this is not unordered data and I'm not clear on > how abandoned chunks are supposed to be handled - I hadn't intentionally > enabled the abandonment functionality, the timetolive was set on the > transmitted chunks by accident. > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html