On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun <[email protected]> wrote:
> As far as I know, revised RFC permits UTF-8 characters in the URL without
> encoding. Am I wrong here?
>
The latest URI RFC is 3986. The relevant description in prose is:
Local names, such as file system names, are stored with a local
character encoding. URI producing applications (e.g., origin
servers) will typically use the local encoding as the basis for
producing meaningful names. The URI producer will transform the
local encoding to one that is suitable for a public interface and
then transform the public interface encoding into the restricted set
of URI characters (reserved, unreserved, and percent-encodings).
Those characters are, in turn, encoded as octets to be used as a
reference within a data format (e.g., a document charset), and such
data formats are often subsequently encoded for transmission over
Internet protocols.
The relevant parts of the BNF are:
pct-encoded = "%" HEXDIG HEXDIG
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
Thus you can't use raw non-ASCII bytes in a URI - they must
be encoded, and interpretation is up to the origin (and is overwhelmingly
utf8 these days).
Even Solr (the search engine) permits them.
>
It would of course be possible for any tool or webserver to
accept URIs with non-ASCII bytes, but I don't know of any
browsers which would _send_ such a request, because in
general it would be rejected.
I tried searching non-ASCII on whitehouse.gov (which uses
Solr) and indeed it generated a percent-encoded query. My
browser (Chrome) rendered the percent escapes as utf-8 for
me though.
There's also punycode which can be used to represent Unicode
domain names (which otherwise don't even allow percent escapes).
In some cases certain browsers will render this for you (generally
if the encoded script matches the top-level country name, e.g.
for a .kr domain Hangul would be shown), but it's in general
a dangerous extension because it makes phishing attempts easier.
--
Alex
_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users