I'm trying to find a spec mandated treatment of non-ASCII characters in http URIs.

RFC 3986 "Uniform Resource Identifier (URI): Generic Syntax" (January 2005) says

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.

However, the latest HTTP spec is RFC 2616, June, 1999, and it references the older RFC 2396, "Uniform Resource Identifiers (URI): Generic Syntax and Semantics", which says

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used.


So there was no generic treatment mandated in 1999, and there's no scheme specific treatment mandated in RFC 2616. I guess the treatment is left up to the implementation. In fact, I've found a few URL decoder/encoder forms on the web which give conflicting results.

Any hints?

Thanks,
Ron







---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to