A user on Frost found despite the recent fixes that he still got a problem resembling the high-bandwidth-usage-way-over-the-limit-no-payload bug. He had the problem only when a specific peer was connected. He sent me logs. I have analysed the logs and it appears that: - While the global bandwidth limit is the dominant factor, the AIMD maximum window increases indefinitely. - Once the external bandwidth pressure (from other peers) abates, we are sending packets at a much faster rate than the link can handle. - If the link then goes down, or has gaps, e.g. because it's over wifi, or severe latency, we may have major problems: we continue to send packets far too quickly, and when they are not acknowledged, although we will slow down the addition of new packets to send, we may have a lot of packets queued by then, and we will have to retransmit them.
Long term causes of this are the fact that our link layer code is not all that TCP-like and needs to be rewritten; it does congestion avoidance at a higher level than it needs to, for example, the congestion window is done on a flow rates basis rather than an actual window... The interaction between the bandwidth limiter and the congestion avoidance algorithm is however interesting in itself. For TCP, bandwidth limiting can be done externally (by dropping packets that go over the limit), or internally (by limiting the flow of data before it reaches TCP). It is not clear that TCP (RFC2581) adequately handles the latter case. There is a note in section 4.1 to the effect that if a connection is idle for a long period and then restarted you can get a burst (so reset to slow-start if the connection has been idle for a while), but it appears to me that the same is possible if you have a trickle of data for a long period and then a burst: TCP will send it at something approaching the maximum possible speed, and get into trouble. IMHO the solution for us is to not increase the congestion window size unless we have actually had a full window in flight recently. I'm not sure how this would fit into the NewTransportLayer... The short term solution appears to be the following: - Have an explicit window for each PacketThrottle: be able to count the number of packets in flight to a specific peer. - Enforce the window, rather than enforcing the rate. - Don't increase the window size (in congestion control) unless we have actually used the full window - had the full window in flight - within the last round trip time. The new transport layer would have an explicit window, and would do bandwidth limiting at a much lower level, including retransmits, have a much larger maximum retransmission window, and generally be much closer to TCP and work better on high latency connections - but without the last provision above, it could still run into this kind of problem. New transport layer: http://wiki.freenetproject.org/NewTransportLayer http://wiki.freenetproject.org/NewPacketFormat Thoughts? The short-term fix should be reasonably easy. The new transport layer otoh could take some time. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20071208/eb0a017e/attachment.pgp>
