On Fri, Jun 17, 2016 at 6:57 AM, Thomas Gleixner <t...@linutronix.de> wrote: > On Fri, 17 Jun 2016, Eric Dumazet wrote: >> > >> > To achieve this capacity with HZ=1000 without increasing the storage >> > size >> > by another level, we reduced the granularity of the first wheel level >> > from >> > 1ms to 4ms. According to our data, there is no user which relies on that >> > 1ms granularity and 99% of those timers are canceled before expiry. >> > >> >> Ah... This might be a problem for people using small TCP RTO timers in >> datacenters (order of 5 ms) >> (and small delay ack timers as well, in the order of 4 ms) >> >> TCP/pacing uses high resolution timer in sch_fq.c so no problem there. >> >> If we arm a timer for 5 ms, what are the exact consequences ? > > The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250 > >> I fear we might trigger lot more of spurious retransmits. >> >> Or maybe I should read the patch series. I'll take some time today. > > Maybe just throw it at such a workload and see what happens :)
Well, when a network congestion happens in a cluster, and hundred of millions of RTO timers fire, adding fuel to the fire, it is a nightmare already ;) To avoid increasing probability of such events we would need to have at least 4 ms difference between the RTO timer and delack timer. Meaning we have to increase both of them and increase P99 latencies of RPC workloads. Maybe a switch to hrtimer would be less risky. But I do not know yet if it is doable without big performance penalty.