Hi Ben,

On Mon, Feb 02, 2026 at 04:01:53PM +0000, Ben Kallus wrote:
> Hi Willy,
> 
> As always, your messages have given me a lot to think about.
> 
> > That's not what's done on anything else. You're not supposed to validate
> > the contents of user-agent if you're not using it. You're not validating
> > the syntax of "vary" if you're not using it, same for bytes-range, or
> > even "cookie". Each of these have their very specific syntax, and the
> > flexibility and the definition of the protocol precisely permits this:
> > you have to validate what you're consuming and can safely ignore the
> > rest that is properly bounded.
> 
> I think this is a bad requirement. A protocol in which you don't have
> to understand messages that you forward/consume is a protocol that's
> bound to have exploitable parsing problems.

No, quite the opposite, it's the principle of an extensible protocol.
If you had to know every specific header's syntax, no single agent
would be compliant because exceptions would pop up every time a new
header is added. Here the protocol is described hierarchically with
clear boundaries, and you go down to the level that concerns you,
avoiding the risk of being wrong on layers you don't master well
because they're out of your business. The deeper you go in analysis,
the hardest it is to stay correct.

> Doubly so when differences in (not) parsing the ignored content may
> affect framing, like we see with chunk-ext.

That's precisely the opposite: here an agent tried to do its best in
a part that was out of its business and got trapped into it.

> > For headers, a header field is what
> > ends at a CRLF (if you don't support folding, otherwise you can detect
> > that next line starts with an LWS and accept to continue the header on
> > the next line).
> 
> This is true, but a compliant implementation also needs to verify that
> a header field value doesn't contain CR, LF, or NUL, even if this
> value's meaning is opaque to the implementation.

Exactly, and that's done, both in headers and chunk-ext, because they
are delimiters (except the NUL which is not one but is a well-known
accidental delimiter and is forbidden for this reason).

> I argue that
> validation of input according to the RFCs should happen at every link
> in the server chain. You might think this is unrealistic :)

This is not unrealistic thanks to the initiative to spark the http-wg at
the IETF roughly 20 years ago, which gathered most major implementations
around the table to analyze differences and figure how to proceed safely.
That led to rfc723x which was the result of about 7 years of hard work
and is a jewel of engineering. During this time, implementations had the
time to adjust their behavior to better match agreed upon rules and even
to just drop support for old versions, so that work resulted in a huge
cleanup of the internet landscape. Also it was the first time we purposely
decided to smoothly degrade experience in ambiguous cases for the sake of
safety. There are cases (that I don't have in mind right now) where it's
suggested to just switch to close mode after certain requests or response,
just to avoid smuggling. That way, we don't break non-compliant
implementations but protect them. It avoids the problem of roll back after
an upgrade causing interoperability issues, that makes users stick to
dangerous versions. That's why nowadays most implementations are more
positive than they used to on getting rid of seriously bad patterns,
because your intermediary will no longer be pointed to as the wrong
one when degrading a seriously non-compliant peer.

> > I
> > can tell you for certain that mismatched quotes in header values are
> > not infrequent at all; you're just free to send what you want in these
> > headers provided that you don't send forbidden chars. Some even use
> > base85 which *does* contain the double-quote.
> 
> This is expected behavior. If a server rejected header values with
> mismatched quotes, that itself would be a bug, because mismatched
> quotes are allowed within header values according to the grammar.
> chunk-ext, on the other hand, does not allow mismatched quotes, so we
> should probably check it.

Yes but it's optional and well-delimited:
  - starts at the semi-colon
  - ends at the CRLF

So there's no ambiguity in how to parse it: you're free to ignore what
starts at the semi-colon till the CRLF if you're not interested in it,
provided that characters are correct (i.e. no CTL there). That's the
difference between the grammar and the character set.

> > And we do check for CR & NUL not appearing in headers (and LF for H2/H3)
> > for the sake of not risking to improperly transcode H2/3 to H1 (and the
> > spec was improved on this front for the same reason). But you can't do
> > much more on headers you simply don't know.
> 
> Yes; this is totally in agreement with what I'm saying. The goal is to
> apply all *generic* validation possible, while leaving
> application-specific validation to the application. I'm just saying we
> should do the same for chunk-ext.

We already did the same for chunk-ext. No CR nor LF was allowed already.
However we didn't check for NUL (nor CTL in general).

However, since you found one example of a problematic implementation, and
in order not to have this discussion every time you find a new one, and
the cost was low, we decided to implement a deeper check. But asking
implementations to implement support for quotes and backslahes is
dangerous (and my first attempt was wrong, which definitely got me
nerves, because risking to introduce security issues for trying to
protect against ones in other components is quite irritating).

> > "Well-formed" is very hard to determine because in some cases it will
> > be improperly formed based on its own definition. Bytes-range is a good
> > example, there have been issues in the past with overlapping ranges that
> > were only exploiting the difficulty to convert it to a unified range
> > while parsing it on the fly, and such issues didn't require any CR nor
> > NUL character or even less qd-text. So no, we do not *validate* headers
> > that are not consumed because noone knows them nor can decide whether
> > they're correct or not, we only validate headers we use (e.g. content-
> > length etc), and that already requires a complex and expensive logic to
> > get these right.
> 
> That's correct. If that definition comes from the RFCs, then a
> compliant server should validate against that definition. For range
> headers, that means syntactic validation according to the grammar, and
> optional semantic validation (see RFC 9110 section 14.2).

Each precise definition stands for the one who consumes them. We do
not consume chunk extension, like essentially any other HTTP agent.
For the range header it's the same: we don't use it so we don't care
what's in it. That is the principle of the standards as defined in
the RFC. It explains how to validate the protocol elements that you
are *using*. This is why the protocol is defined in a hierarichical
way. You have the message, composed of a headers block, body and
optional trailers, etc. The spec even makes special rules for checking
certain elements even if you don't use them, so as not to blindly
forward them and be in trouble between two other incompatible
implementations. This approach is critical to the whole security of
the ecosystem.

> If the header is instead totally unrecognized (i.e., not one from the
> RFCs) then of course it can't be validated beyond the basic CR, LF,
> NUL checks.

CR and NUL checks are indeed among the examples that are forbidden in
*any* header, regardless of your intent to process them or not, often
because if you forward them, there's a risk that the next agent stops
at a different point (think that originally HTTP parsers were shell
scripts calling "read" in loops). Also there was the case of agents
running under dos/windows where the end of line delimiter is the lone
CR, and where using generic functions could result in your line being
split at a lone CR while a UNIX system would split it on the LF. That's
why it's mandated to check against these two characters anywhere.

> > Except that in this specific case it ignores the outer delimiter (CRLF).
> 
> Where does it say that "ignoring" a grammatical element means skipping
> to the delimiter after the element?

It's not "ignoring", it's delimiting. CRLF marks *the end* of the chunk
extension. It's in the protocol spec:

  chunk          = chunk-size [ chunk-ext ] CRLF
                   chunk-data CRLF

  chunk-ext      = *( BWS ";" BWS chunk-ext-name
                      [ BWS "=" BWS chunk-ext-val ] )

So in order to process a chunk, you have to:

 - read a chunk size
 - optionally read a chunk-ext
 - read a CRLF
 - read a chunk-data
 - read a CRLF

You don't need to know more. The spec also says:

     A recipient MUST ignore unrecognized chunk extensions.

Thus when you know no chunk extension, you have everything above to
properly process your chunk ignoring any extension.

But as I said, we've added this specific handling anyway, as our
regular contribution to improving the whole ecosystem's robustness.

BTW, for adrenaline, you should probably have a look at HTTP stacks used
in IoT, maybe some of the stuff you use yourself or some of your relatives
do. In short, they're horrible, with strstr() or blah.findoccurrence("xxx")
or whatever all over the place, sometimes just counting bytes and skipping
them because "we know that 0 CR is necessarily followed by LFCRLF hence 5
bytes total" etc. I started to significantly clean one a long time ago and
gave up when noticing a chunk size parser being passed to sscanf() and
stuff like this. Seeing this type of thing sometimes reminds me that
what we're doing is not wasted time ;-)

Willy


Reply via email to