I'm trying to find a spec mandated treatment of non-ASCII characters in
http URIs.
RFC 3986 "Uniform Resource Identifier (URI): Generic Syntax" (January
2005) says
When a new URI scheme defines a component that represents textual data
consisting of characters from the Universal Character Set [UCS], the
data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-encoded.
However, the latest HTTP spec is RFC 2616, June, 1999, and it
references the older RFC 2396, "Uniform Resource Identifiers (URI):
Generic Syntax and Semantics", which says
For original character sequences that contain non-ASCII characters,
however, the situation is more difficult. Internet protocols that
transmit octet sequences intended to represent character sequences are
expected to provide some way of identifying the charset used, if there
might be more than one [RFC2277]. However, there is currently no
provision within the generic URI syntax to accomplish this
identification. An individual URI scheme may require a single charset,
define a default charset, or provide a way to indicate the charset used.
So there was no generic treatment mandated in 1999, and there's no
scheme specific treatment mandated in RFC 2616. I guess the treatment
is left up to the implementation. In fact, I've found a few URL
decoder/encoder forms on the web which give conflicting results.
Any hints?
Thanks,
Ron
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]