Re: Buggy receive window auto-tuning inside multiple QUIC stacks (Re: More QUIC, how?)

Kazuho Oku Fri, 17 Oct 2025 14:47:24 -0700

2025年10月16日(木) 11:17 Ian Swett <[email protected]>:

>
>
> On Tue, Oct 14, 2025 at 9:28 PM Kazuho Oku <[email protected]> wrote:
>
>>
>>
>> 2025年10月14日(火) 23:45 Ian Swett <[email protected]>:
>>
>>> Thanks for bringing this up, Kazuho.  My back-of-the envelope math also
>>> indicated that 1/3 was a better value than 1/2 when I looked into it a few
>>> years ago, but I never constructed a clean test to prove it with a real
>>> congestion controller.  Unfortunately, our congestion control simulator is
>>> below our flow control layer.
>>>
>>> It probably makes sense to test this in the real-world and see if
>>> reducing it to 1/3 measurably reduces the number of blocked frames we
>>> receive on our servers.
>>>
>>
>> Makes perfect sense. In fact, that was how we noticed the problem —
>> someone asked us why the H3 traffic we were serving was slower than H2.
>> Looking at the stats, we saw that we were receiving blocked frames, and
>> ended up reading the client-side source code to identify the bug.
>>
>>
>>> There are use cases when auto-tuning is nice.  Even for Chrome, there
>>> are cases when if we started with a smaller stream flow control window, we
>>> would have avoided some bugs where a few streams consume the entire
>>> connection flow control window.
>>>
>>
>> Yeah, it can certainly be useful at times to block the sender’s progress
>> so that resources can be utilized elsewhere.
>>
>> That said, blocking Slow Start from making progress is a different matter
>> — especially after spending so much effort developing QUIC based on the
>> idea that reducing startup latency by one RTT is worth it.
>>
>>
>
> I completely agree.  Is there a good heuristic for guessing whether the
> peer is still in slow start, particularly when one doesn't know what
> congestion controller they're using?
>


IIUC, the primary intent of auto-tuning is to avoid bufferbloat when the
receiving application is slow to read.

The intent makes perfect sense, but I’m under the impression that the "old"
approach - estimating the sender’s rate and trying to stay slightly ahead
of it - is showing its age.

As discussed, that approach is fragile and depends on specific sender
behavior. It also breaks down at the stream level: when multiplexing
streams for different applications over one connection, we'd like to limit
bufferbloat per-stream, yet streams themselves don't perform Slow Start.

So I wonder if we can estimate the maximum safe receive rate relying on
pressure from the sender.

There might be two pragmatic paths:

1. Application-driven estimate (endpoints):

If the QUIC stack delivers data via a callback, measure how long that
callback takes and estimate the app's sustainable rate as:

    throughput = bytes_provide / callback_duration

This directly ties the advertised window to the consumer's read rate.

2. Downstream-driven estimate (intermediaries):

For a proxy or a gateway, estimate downstream capacity as:

    throughput = downstream_send_window_size / downstream_RTT

That gives an instantaneous bound based on what the next hop can actually
accept.

Both approaches might have caveats, but they produce immediate estimates
rather than tracking the sender as it ramps up.
With these approaches, there's no need to guess the sender's rate - whether
it's in Slow Start, Congestion Avoidance, or using something like Careful
Resume.

Note also that these approaches aren't necessarily in conflict with the
traditional one: a receiver could start with these estimations to establish
the baseline, then gradually increase the window to probe for higher
limits, just as the old method would.

WDYT?


>
> One could certainly use a heuristic such as the first N packets of the
> connection I'll assume they might be in slow start and then change
> strategies, but it's clearly not perfect.
>
>
>>
>>> Thanks, Ian
>>>
>>> On Tue, Oct 14, 2025 at 8:49 AM Kazuho Oku <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> 2025年10月14日(火) 19:36 Max Inden <[email protected]>:
>>>>
>>>>> * Send MAX_DATA / MAX_STREAM_DATA no later than when 33% (i.e., 1/3)
>>>>> of the credit is consumed.
>>>>>
>>>>> Firefox will send MAX_STREAM_DATA after 25% of the credit has been
>>>>> consumed.
>>>>>
>>>>>
>>>>> https://github.com/mozilla/neqo/blob/791fd40fb7e9ee4599c07c11695d1849110e704b/neqo-transport/src/fc.rs#L30-L37
>>>>>
>>>>> * Instead of doubling (x2) the window size, increase it by a larger
>>>>> factor (e.g., x4).
>>>>>
>>>>> Firefox will increase the window by up to 4x the overshoot of the
>>>>> current BDP estimate.
>>>>>
>>>>
>>>> Good to know that Firefox uses these numbers. They look fine to me,
>>>> though depending on the size of the initial credit, Careful Resume might
>>>> get blocked.
>>>>
>>>>>
>>>>> https://github.com/mozilla/neqo/blob/791fd40fb7e9ee4599c07c11695d1849110e704b/neqo-transport/src/fc.rs#L402-L409
>>>>>
>>>>> * disable auto tuning entirely (it's needed only for latency-sensitive
>>>>> applications).
>>>>>
>>>>> What would be a reasonable one-size-fits-all stream data window size,
>>>>> which at the same time doesn't expose the receiver to a memory exhaustion
>>>>> attack?
>>>>>
>>>>> Because it is difficult to estimate the sender's initial window and
>>>>> how quicly it ramps up - especially with algorithms like Careful Resume,
>>>>> which don't use Slow Start - my preference is to disable auto tuning by
>>>>> default.
>>>>>
>>>>> Wouldn't a high but reasonable start value + window auto-tuning be
>>>>> ideal?
>>>>>
>>>>
>>>>
>>>> Yeah, I think there’s often confusion between two distinct aspects:
>>>> a) the maximum buffer size that the receiver can allocate, and
>>>> b) how fast the sender might transmit.
>>>>
>>>> A is what receivers need to prevent memory-exhaustion attacks. It’s
>>>> purely a local policy: the limit might be 1 MB or 10 MB, but it’s unrelated
>>>> to B — that is, it doesn’t depend on how quickly the sender sends.
>>>>
>>>> For latency-sensitive applications that read slowly, it’s important to
>>>> cap the receive buffer at roughly read_speed × latency, because otherwise
>>>> bufferbloat increases latency. But again, that consideration is separate
>>>> from B.
>>>>
>>>> In my view, B mainly concerns minimizing the amount of memory allocated
>>>> inside the kernel. Kernel-space memory management is far more constrained
>>>> than in user space: allocations often have to be contiguous, and falling
>>>> back to swap is not an option. Note also that the TCP/IP stack is decades
>>>> old, from an era when memory was a much more precious resource than it is
>>>> today.
>>>>
>>>> In contrast, a QUIC stack running in user space can rely on virtual
>>>> memory, where fragmentation is rarely a real issue. When a user-space
>>>> buffer fills up, the program can simply call realloc() and append
>>>> data—possibly incurring operations such as virtual-memory remapping or
>>>> paging. There is no need to pre-reserve large contiguous chunks of memory.
>>>>
>>>> To summarize, there is far less need in QUIC, if any, to minimize the
>>>> receive window advertised to the peer, compared to what was necessary for
>>>> in-kernel TCP.
>>>>
>>>>
>>>>> On 14/10/2025 03.38, Kazuho Oku wrote:
>>>>>
>>>>>
>>>>>
>>>>> 2025年9月29日(月) 16:28 Max Inden <[email protected]>:
>>>>>
>>>>>> For what it is worth, also referencing previous discussion on this
>>>>>> list:
>>>>>>
>>>>>> "Why isn't QUIC growing?"
>>>>>>
>>>>>>
>>>>>> https://mailarchive.ietf.org/arch/msg/quic/RBhFFY3xcGRdBEdkYmTK2k926mQ/
>>>>>>
>>>>>
>>>>> Reading the old thread, I'm reminded that people often assume QUIC
>>>>> performs better than TCP. However, that is true only when the QUIC stack 
>>>>> is
>>>>> implemented, configured, and deployed correctly.
>>>>>
>>>>> One bug I've seen in multiple stacks - one that significantly affects
>>>>> benchmark results - is the failure to auto-tune the receive window as
>>>>> aggressively as the sender's Slow Start allows.
>>>>>
>>>>> Based on my understanding, Google Quiche implements receive window
>>>>> auto-tuning as follows:
>>>>> * Send MAX_DATA / MAX_STREAM_DATA when 50% of the credit has been
>>>>> consumed.
>>>>> * Double the window size when these frames are frequently.
>>>>>
>>>>> Several other stacks have adopted this approach.
>>>>>
>>>>> The problem with this logic is that it's too conservative and causes
>>>>> the sender to become flow-control-blocked during Slow Start.
>>>>>
>>>>> Consider the following example:
>>>>> 1. The receiver advertises an initial Maximum Data of W.
>>>>> 2. After receiving 0.5W bytes, the receiver sends Maximum Data=2.5W
>>>>> along with ACKs up to W/2. The next Maximum Data will be sent once the
>>>>> receiver has received 1.5W bytes.
>>>>> 3. The receiver receives bytes up to W and ACKs them.
>>>>> 4. At this point, the sender's Slow Start permits transmission up to
>>>>> 2W bytes, but the advertised receive window is only 1.5W. As a rsult, the
>>>>> connection becomes flow-control-blocked.
>>>>>
>>>>> There are several ways to address this issue:
>>>>> * Send MAX_DATA / MAX_STREAM_DATA no later than when 33% (i.e., 1/3)
>>>>> of the credit is consumed.
>>>>> * Instead of doubling (x2) the window size, increase it by a larger
>>>>> factor (e.g., x4).
>>>>> * disable auto tuning entirely (it's needed only for latency-sensitive
>>>>> applications).
>>>>>
>>>>> Because it is difficult to estimate the sender's initial window and
>>>>> how quicly it ramps up - especially with algorithms like Careful Resume,
>>>>> which don't use Slow Start - my preference is to disable auto tuning by
>>>>> default.
>>>>>
>>>>> In fact, this is also the choice made by Chromium, which is why it is
>>>>> not affected by this bug!
>>>>>
>>>>> For reference, Tatshiro addressed this issue in ngtcp2 in the
>>>>> follwing PRs;
>>>>> * https://github.com/ngtcp2/ngtcp2/pull/1396 - Tweak threshold for
>>>>> max_stream_data and max_data transmission
>>>>> * https://github.com/ngtcp2/ngtcp2/pull/1397 - Add note for window
>>>>> auto-tuning
>>>>> * https://github.com/ngtcp2/ngtcp2/pull/1398 - examples/client:
>>>>> Disable window auto-tuning by default
>>>>>
>>>>> However, I suspect the bug may still exist in other stacks.
>>>>>
>>>>> On 29/09/2025 05.38, Lars Eggert wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> pitch for a discussion at 124.
>>>>>>
>>>>>> https://radar.cloudflare.com/
>>>>>> <https://radar.cloudflare.com/adoption-and-usage?dateRange=52w> and
>>>>>> similar stats have had H3 around 30% for a few years now, with little
>>>>>> changes since the first quichbram up to that level.
>>>>>>
>>>>>> Topic: why is that and is there anything the WG or IETF can do to
>>>>>> change it (upwards, of course)?
>>>>>>
>>>>>> Thanks,
>>>>>> Lars
>>>>>> --
>>>>>> Sent from a mobile device; please excuse typos.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Kazuho Oku
>>>>>
>>>>>
>>>>
>>>> --
>>>> Kazuho Oku
>>>>
>>>
>>
>> --
>> Kazuho Oku
>>
>

-- 
Kazuho Oku

Re: Buggy receive window auto-tuning inside multiple QUIC stacks (Re: More QUIC, how?)

Reply via email to