Hi,
I only just noticed this discussion. Essentially, I think you have arrived at
the right conclusion regarding URIs.
For more background, the IRI document makes interesting reading in this context:
http://tools.ietf.org/html/rfc3987; esp. sections 2, 2.1.
The IRI is defined in terms of Unicode characters, which themselves may be
described/referenced in terms of their code points, but the character encoding
is not prescribed.
In practice, I think systems are increasingly using UTF-8 for transmitting IRIs
and URIs, and using either UTF-8 or UTF-16 for internal storage. There is still
a legacy of ISO-8859-1 being defined asthe default charset for HTML (cf.
http://www.w3.org/International/O-HTTP-charset for further discussiomn).
#g
--
On 14/03/2012 06:43, Jason Dusek wrote:
2012/3/12 Jeremy Shaw<jer...@n-heptane.com>:
On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek<jason.du...@gmail.com> wrote:
Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component.
Right. This describes how to convert an octet into a sequence of characters,
since the only thing that can appear in a URI is sequences of characters.
The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely
by the identity of the characters that are used. There is definitely no
single, correct byte sequence for representing a URI. If I give you a
sequence of bytes and tell you it is a URI, the only way to decode it is to
first know what encoding the byte sequence represents.. ascii, utf-16, etc.
Once you have decoded the byte sequence into a sequence of characters, only
then can you parse the URI.
Mr. Shaw,
Thanks for taking the time to explain all this. It's really
helped me to understand a lot of parts of the URI spec a lot
better. I have deprecated my module in the latest release
http://hackage.haskell.org/package/URLb-0.0.1
because a URL parser working on bytes instead of characters
stands out to me now as a confused idea.
--
Jason Dusek
pgp /// solidsnack 1FD4C6C1 FED18A2B
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe