Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

Alex Shinn Sun, 13 Jan 2013 21:43:01 -0800

On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun <[email protected]> wrote:


> As far as I know, revised RFC permits UTF-8 characters in the URL without
> encoding. Am I wrong here?
>

The latest URI RFC is 3986.  The relevant description in prose is:

  Local names, such as file system names, are stored with a local
  character encoding.  URI producing applications (e.g., origin
  servers) will typically use the local encoding as the basis for
  producing meaningful names.  The URI producer will transform the
  local encoding to one that is suitable for a public interface and
  then transform the public interface encoding into the restricted set
  of URI characters (reserved, unreserved, and percent-encodings).
  Those characters are, in turn, encoded as octets to be used as a
  reference within a data format (e.g., a document charset), and such
  data formats are often subsequently encoded for transmission over
  Internet protocols.

The relevant parts of the BNF are:

   pct-encoded = "%" HEXDIG HEXDIG

   reserved    = gen-delims / sub-delims

   gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

   sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
               / "*" / "+" / "," / ";" / "="

   unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters

   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>

   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

Thus you can't use raw non-ASCII bytes in a URI - they must
be encoded, and interpretation is up to the origin (and is overwhelmingly
utf8 these days).

Even Solr (the search engine) permits them.
>

It would of course be possible for any tool or webserver to
accept URIs with non-ASCII bytes, but I don't know of any
browsers which would _send_ such a request, because in
general it would be rejected.

I tried searching non-ASCII on whitehouse.gov (which uses
Solr) and indeed it generated a percent-encoded query.  My
browser (Chrome) rendered the percent escapes as utf-8 for
me though.

There's also punycode which can be used to represent Unicode
domain names (which otherwise don't even allow percent escapes).
In some cases certain browsers will render this for you (generally
if the encoded script matches the top-level country name, e.g.
for a .kr domain Hangul would be shown), but it's in general
a dangerous extension because it makes phishing attempts easier.

-- 
Alex

_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

Reply via email to