On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote: > > > On 04/06/2018 03:05 AM, Michal Kubecek wrote: > > Hello, > > > > I encountered a strange behaviour of some (non-linux) TCP stack which > > I believe is incorrect but support engineers from the company producing > > it claim is OK. > > > > Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized > > segments but segments 2, 4 and 6 do not reach the server (receiver): > > > > ACK SAK SAK SAK > > +-------+-------+-------+-------+-------+-------+-------+ > > | 1 | 2 | 3 | 4 | 5 | 6 | 7 | > > +-------+-------+-------+-------+-------+-------+-------+ > > 34273 35701 37129 38557 39985 41413 42841 44269 > > > > When segment 2 is retransmitted after RTO timeout, normal response would > > be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and > > 42841-44269). > > > > However, this server stack responds with two separate ACKs: > > > > - ACK 37129, SACK 37129-38557 39985-41413 42841-44269 > > - ACK 38557, SACK 39985-41413 42841-44269 > > Hmmm... Yes this seems very very wrong and lazy. > > Have you verified behavior of more recent linux kernel to such threats ?
No, unfortunately the problem was only encountered by our customer in production environment (they tried to reproduce in a test lab but no luck). They are running backups to NFS server and it happens from time to time (in the order of hours, IIUC). So it would be probably hard to let them try with more recent kernel. On the other hand, they reported that SLE11 clients (kernel 3.0) do not run into this kind of problem. It was originally reported as a a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4 kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks" (which seems to be mostly caused by the switch to RACK). It also seems that part of the problem is specific packet loss pattern where at some point, many packets are lost in "every second" pattern. The customer finally started to investigate this problem and it seems it has something to do with their bonding setup (they provided no details, my guess is packets are divided over two paths and one of them fails). > packetdrill test would be relatively easy to write. I'll try but I have very little experience with writing packetdrill scripts so it will probably take some time. > Regardless of this broken alien stack, we might be able to work around > this faster than the vendor is able to fix and deploy a new stack. > > ( https://en.wikipedia.org/wiki/Robustness_principle ) > Be conservative in what you do, be liberal in what you accept from > others... I was thinking about this a bit. "Fixing" the acknowledgment number could do the trick but it doesn't feel correct. We might use the fact that TSecr of both ACKs above matches TSval of the retransmission which triggered them so that RTT calculated from timestamp would be the right one. So perhaps something like "prefer timestamp RTT if measured RTT seems way too off". But I'm not sure if it couldn't break other use cases where (high) measured RTT is actually correct, rather than (low) timestamp RTT. Michal Kubecek