Non-ASCII in HTTP URIs

Ron Sigal Fri, 18 Nov 2011 09:33:34 -0800

I'm trying to find a spec mandated treatment of non-ASCII characters inhttp URIs.

RFC 3986 "Uniform Resource Identifier (URI): Generic Syntax" (January2005) says

When a new URI scheme defines a component that represents textual dataconsisting of characters from the Universal Character Set [UCS], thedata should first be encoded as octets according to the UTF-8character encoding [STD63]; then only those octets that do notcorrespond to characters in the unreserved set should be percent-encoded.

However, the latest HTTP spec is RFC 2616, June, 1999, and itreferences the older RFC 2396, "Uniform Resource Identifiers (URI):Generic Syntax and Semantics", which says

For original character sequences that contain non-ASCII characters,however, the situation is more difficult. Internet protocols thattransmit octet sequences intended to represent character sequences areexpected to provide some way of identifying the charset used, if theremight be more than one [RFC2277]. However, there is currently noprovision within the generic URI syntax to accomplish thisidentification. An individual URI scheme may require a single charset,define a default charset, or provide a way to indicate the charset used.

So there was no generic treatment mandated in 1999, and there's noscheme specific treatment mandated in RFC 2616. I guess the treatmentis left up to the implementation. In fact, I've found a few URLdecoder/encoder forms on the web which give conflicting results.


Any hints?

Thanks,
Ron







---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Non-ASCII in HTTP URIs

Reply via email to