2012/3/11 Jeremy Shaw <jer...@n-heptane.com>: > Also, URIs are not defined in terms of octets.. but in terms > of characters. If you write a URI down on a piece of paper -- > what octets are you using? None.. it's some scribbles on a > paper. It is the characters that are important, not the bit > representation.
Well, to quote one example from RFC 3986: 2.1. Percent-Encoding A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points. > If you render a URI in a utf-8 encoded document versus a > utf-16 encoded document.. the octets will be different, but > the meaning will be the same. Because it is the characters > that are important. For a URI Text would be a more compact > representation than String.. but ByteString is a bit dodgy > since it is not well defined what those bytes represent. > (though if you use a newtype wrapper around ByteString to > declare that it is Ascii, then that would be fine). This is all fine well and good for what a URI is parsed from and what it is serialized too; but once parsed, the major components of a URI are all octets, pure and simple. Like the "host" part of the authority: host = IP-literal / IPv4address / reg-name ... reg-name = *( unreserved / pct-encoded / sub-delims ) The reg-name production is enough to show that, once the host portion is parsed, it could contain any bytes whatever. ByteString is the only correct representations for a parsed host and userinfo, as well as a parsed path, query or fragment. -- Jason Dusek pgp /// solidsnack 1FD4C6C1 FED18A2B _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe