2026年3月25日(水) 6:09 Martin Thomson <[email protected]>: > On Tue, Mar 24, 2026, at 21:06, Kazuho Oku wrote: > > * it avoids requiring the decoder of every frame type to support trial > > or incremental decoding to handle truncated input; > > As I said, while you might simplify, but in doing so you introduce > performance penalties that ultimately lead you back to having to handle > incremental decoding. That this is an option is potentially valuable and > an argument in favor, I wouldn't weight it too much. > > However, frame decoding is such a small part of an implementation that I > see this as less of a problem than you are making out. > > > * the overhead is identical when sending large data; and > > Not entirely correct, but it's very close. When the length spans more > (the "record" length always includes at least a STREAM frame header) there > is a chance that the varint needs more bytes. So it will be every so > slightly higher overhead. > > > * it naturally reduces the risk of blocking caused by frames crossing > > TLS record boundaries. > > I don't think this is right. If you are blocking on the "record" being > complete, any problem will be *worse*, not better if the "record" spans > more bytes. >
Could you clarify why? My point is not that buffering disappears with records. Rather, my point is that introducing records greatly reduces the likelihood of STREAM frames spanning multiple TLS records, and therefore that the resulting performance penalty is largely avoided even if a receiver does not implement trial or incremental decoding. The step-by-step reasoning is as follows: The default maximum size of a QMux record is 16382 bytes. Therefore, when a QMux sender has enough data, a reasonable implementation would generate QMux records of exactly 16384 bytes (16382 bytes of record payload prefixed with a 2-byte length field) and pass the records to the TLS stack, either one by one or as a batch. When 16KB or more cleartext is given to a TLS write function, a natural behavior is to generate full-sized TLS records, followed by a short TLS record containing the remainder. Since each full-sized TLS record contains exactly 16384 bytes of payload, each TLS record would contain exactly one QMux record. For TLS stacks, this is a natural behavior because (1) unless a TCP-cork-style API is used, they need to encrypt and send all the provided data without waiting for more, (2) since the underlying TCP stack does not typically expose exactly how much space remains in the send buffer, the TLS stack cannot reliably emit shorter TLS records chosen to fit that buffer. OpenSSL behaves this way regardless of whether SSL_MODE_ENABLE_PARTIAL_WRITE is set. So does picotls. This in turn means that whenever a QMux-over-TLS receiver decrypts a TLS record, a complete QMux record becomes available. As a result, additional latency due to QMux-layer buffering is avoided, even if the receiver does not implement trial or incremental processing. Is there a part of that reasoning that you think does not hold? > > Put differently, I think QMux records provide a structure that is > > easier to implement efficiently across QMux stacks. > > As noted in my previous message, I think that this hides complexity and > makes blocking worse. An efficient implementation will need to engage with > incremental decoding regardless of any framing layer. It's a small thing, > and it's not a blocker for me, but I am mildly opposed to the change. > -- Kazuho Oku
