Hello, We detected a problem with IPVS module. Here's a quick summary of what triggers the problem:
- IPVS has a hardcoded TIME_WAIT timeout of 120s - TCP/IP layer on the kernel has a hardcoded TIME_WAIT timeout of 60s - the connection rescheduling mechanism on IPVS acts by dropping the first received SYN message and letting the client retransmit the SYN message after (also hardcoded) RTO timeout, which in practice seems to be 1s Here is a scenario that triggers this problem: - we have some backend server balanced by IPVS - we have an external load balancer that balances requests from real clients to IPVS and does SNAT Here is what happens previous scenario under high throughput: - the external load balancer is behaving (due to SNAT) as a single origin IP for requests forwarded to IPVS - IPVS receives connections and forwards them to internal servers, but once served, on the IPVS connection table, connections remain in TIME_WAIT during 120s - the external load balancer has a TIME_WAIT of 60s, so after this time (or before if reusing connections in TIME_WAIT) it recycles the same ephemeral ports to send requests to IPVS - in-between those 60s (where the external LB starts reusing ports) and those 120s (where IPVS still has the connection in TIME_WAIT), the re-scheduling mechanism on IPVS has the result of adding a 1s delay (due to SYN-drop and the RTO timeout on the LB) to the connection establishment And this implies that when the external LB is under mid load, approx 250 req/s (calculated from [net.ipv4.ip_local_port_range on the LB] divided by [TW timeout on the LB = 60s]), the rescheduling mechanism at IPVS adds a delay of 1s to the establishment of TCP connections to internal servers. This 1s delay seems to be either caused by: - a mismatch between hardcoded TW-timeout on: IPVS = 120s, standard kernel TCP driver = 60s - the rescheduling algorithm on IPVS that forces the client (the LB) to wait an entire RTO before retransmitting the SYN packet I'm not telling that IPVS is either bad parametrized neither that the rescheduling algorithm is bad designed. You guys are awesome and have done a really great work with IPVS. The question is then: what can we do to avoid that 1s delay when rescheduling connections? If you need it, I can elaborate on all the previous details, even provide a link of a github issue (for the docker project) with the details on how we arrived at sending an email to this list. Thanks in advance, Toni _______________________________________________ Please read the documentation before posting - it's available at: http://www.linuxvirtualserver.org/ LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org Send requests to lvs-users-requ...@linuxvirtualserver.org or go to http://lists.graemef.net/mailman/listinfo/lvs-users