Re: request parameters mishandle utf-8 encoding

André Warnier Fri, 01 Aug 2008 14:58:03 -0700

Christopher Schultz wrote:
[...]

Here is the definitive reference :
http://www.faqs.org/rfcs/rfc2396.html
and see 1.5. URI Transcribability and following if you are courageous.

And the HTTP 1.1 RFC 2616 makes reference to the above RFC in whatregards URL encoding.

The point is that the URL contained in the HTTP request line (the firstline) cannot be considered to be in any particular encoding, unless theclient and server somehow agree on a convention in advance.All it says in the specs, is that only certain ranges of bytes areallowed "as is" in URL's, and the rest should be escaped, and it sayshow they should be escaped.

To say this in lay language : you can decide to write a URL in prettymuch any encoding of any character set you want, but then, once you haveyour encoded URL, you should scan it byte by byte, and any byte that isnot in the accepted "as is" range should be encoded as per the spec.The accepted range is, generally speaking, the byte values thatcorrespond to the printable characters in the latin-1 alphabet, minussome "excluded" characters like #,<,>,/ etc...

For example, if your choice of encoding was so that, after encoding, atposition 30 of your URL string was a byte with a hex value 0x20 (whichin iso-8859-1 is a space), then it should be replaced by a "+".Similarly, if after the original encoding there happened to be a byte atposition 40 with a hex value of 0x0D (CR, a control character), itshould be replaced by the sequence %0D. And so on.


Now, whether the server will "understand" your URL is another matter.

The receiving HTTP server should first of all decode the received URL inthe same way, before any further decoding is done. Thus, from left toright, any "+" byte should be replaced by a byte 0x20, any sequence"%0D" should be replaced by the single byte with hex value 0x0D, etc..

Then, by default, it is the convention that in the absence of any otherinformation or convention, the resulting string should be considered asbeing in the iso-8859-1 (latin-1) alphabet.

However, if the client and server have somehow made a convention thatthey would exchange URLs containing Unicode characters, encoded asUTF-8, that's fine.

After the HTTP Request line, are any number of HTTP headers. As far asI remember, these should conform to the rules for MIME headers, whichmay well specify that they should be limited to ASCII, I am too lazy tocheck.


Then there may be a blank line, followed by a request content.

For that one, the situation is totally different, because a precedingHTTP header should specify the content-type, and if it is text, thecharacter-set and encoding used.

By using the option in Tomcat that specifies "consider the request URLas being in the same encoding as the request body", you are making thebig assumption that you know the client, and that you know that it willsend requests that way.Between a client and a server that "don't know eachother", it is veryunsafe to make that assumption. Specifying this parameter in Tomcat isnot going to magically make your client respect that convention.


It's a pity, but that's the way it is with HTTP 1.1.

The people who designed the protocol and wrote the specs did a greatjob, but did not include any unambiguous way to specify, in the URLitself, in which character set or encoding of ditto it was written, ifit is not the default latin-1.

In the SMTP protocol, by contrast, there exists a way to specify theencoding of a header value (e.g. the "Subject" header), at the beginningof the header value itself.


André

---------------------------------------------------------------------
To start a new topic, e-mail: [email protected]
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: request parameters mishandle utf-8 encoding

Reply via email to