Hi Willy, As always, your messages have given me a lot to think about.
> That's not what's done on anything else. You're not supposed to validate > the contents of user-agent if you're not using it. You're not validating > the syntax of "vary" if you're not using it, same for bytes-range, or > even "cookie". Each of these have their very specific syntax, and the > flexibility and the definition of the protocol precisely permits this: > you have to validate what you're consuming and can safely ignore the > rest that is properly bounded. I think this is a bad requirement. A protocol in which you don't have to understand messages that you forward/consume is a protocol that's bound to have exploitable parsing problems. Doubly so when differences in (not) parsing the ignored content may affect framing, like we see with chunk-ext. > For headers, a header field is what > ends at a CRLF (if you don't support folding, otherwise you can detect > that next line starts with an LWS and accept to continue the header on > the next line). This is true, but a compliant implementation also needs to verify that a header field value doesn't contain CR, LF, or NUL, even if this value's meaning is opaque to the implementation. I argue that validation of input according to the RFCs should happen at every link in the server chain. You might think this is unrealistic :) > I > can tell you for certain that mismatched quotes in header values are > not infrequent at all; you're just free to send what you want in these > headers provided that you don't send forbidden chars. Some even use > base85 which *does* contain the double-quote. This is expected behavior. If a server rejected header values with mismatched quotes, that itself would be a bug, because mismatched quotes are allowed within header values according to the grammar. chunk-ext, on the other hand, does not allow mismatched quotes, so we should probably check it. > And we do check for CR & NUL not appearing in headers (and LF for H2/H3) > for the sake of not risking to improperly transcode H2/3 to H1 (and the > spec was improved on this front for the same reason). But you can't do > much more on headers you simply don't know. Yes; this is totally in agreement with what I'm saying. The goal is to apply all *generic* validation possible, while leaving application-specific validation to the application. I'm just saying we should do the same for chunk-ext. > "Well-formed" is very hard to determine because in some cases it will > be improperly formed based on its own definition. Bytes-range is a good > example, there have been issues in the past with overlapping ranges that > were only exploiting the difficulty to convert it to a unified range > while parsing it on the fly, and such issues didn't require any CR nor > NUL character or even less qd-text. So no, we do not *validate* headers > that are not consumed because noone knows them nor can decide whether > they're correct or not, we only validate headers we use (e.g. content- > length etc), and that already requires a complex and expensive logic to > get these right. That's correct. If that definition comes from the RFCs, then a compliant server should validate against that definition. For range headers, that means syntactic validation according to the grammar, and optional semantic validation (see RFC 9110 section 14.2). If the header is instead totally unrecognized (i.e., not one from the RFCs) then of course it can't be validated beyond the basic CR, LF, NUL checks. > Except that in this specific case it ignores the outer delimiter (CRLF). Where does it say that "ignoring" a grammatical element means skipping to the delimiter after the element? > I've done my best to also deal with CTLs on the line (including NUL > but not HT) and check for mismatched quotes and backslashes. While > refining it, I figured that the code starts to be complex, and that > I wouldn't be surprised if you found more non-compliant variants > that choke on NUL inside the extension (even inside quotes), or a > sequence such as backslash CR CR LF which some might interprete as a > backslashed CR followed by CRLF, and others as a double CR followed > by LF. Similarly we could imagine that <quote> backslash CR <quote> > CR LF could be mistaken for a double CRLF by implementations only > checking for the CR and skipping the next char assuming an LF while > here it would only be a quote. I'm just suggesting some ideas, as > I know that you love playing with that :-) Thanks for the suggestions! This sounds like a good direction.

