Hi,

I only just noticed this discussion. Essentially, I think you have arrived at the right conclusion regarding URIs.

For more background, the IRI document makes interesting reading in this context: http://tools.ietf.org/html/rfc3987; esp. sections 2, 2.1.

The IRI is defined in terms of Unicode characters, which themselves may be described/referenced in terms of their code points, but the character encoding is not prescribed.

In practice, I think systems are increasingly using UTF-8 for transmitting IRIs and URIs, and using either UTF-8 or UTF-16 for internal storage. There is still a legacy of ISO-8859-1 being defined asthe default charset for HTML (cf. http://www.w3.org/International/O-HTTP-charset for further discussiomn).

#g
--

On 14/03/2012 06:43, Jason Dusek wrote:
2012/3/12 Jeremy Shaw<jer...@n-heptane.com>:
On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek<jason.du...@gmail.com>  wrote:
Well, to quote one example from RFC 3986:

  2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.

Right. This describes how to convert an octet into a sequence of characters,
since the only thing that can appear in a URI is sequences of characters.

The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.


Not sure what you mean by this. As the RFC says, a URI is defined entirely
by the identity of the characters that are used. There is definitely no
single, correct byte sequence for representing a URI. If I give you a
sequence of bytes and tell you it is a URI, the only way to decode it is to
first know what encoding the byte sequence represents.. ascii, utf-16, etc.
Once you have decoded the byte sequence into a sequence of characters, only
then can you parse the URI.

Mr. Shaw,

Thanks for taking the time to explain all this. It's really
helped me to understand a lot of parts of the URI spec a lot
better. I have deprecated my module in the latest release

   http://hackage.haskell.org/package/URLb-0.0.1

because a URL parser working on bytes instead of characters
stands out to me now as a confused idea.

--
Jason Dusek
pgp  ///  solidsnack  1FD4C6C1 FED18A2B

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to