2012/3/12 Jeremy Shaw <[email protected]>: > > The syntax of URIs is a mechanism for describing data octets, > > not Unicode code points. It is at variance to describe URIs in > > terms of Unicode code points. > > Not sure what you mean by this. As the RFC says, a URI is defined entirely > by the identity of the characters that are used. There is definitely no > single, correct byte sequence for representing a URI. If I give you a > sequence of bytes and tell you it is a URI, the only way to decode it is to > first know what encoding the byte sequence represents.. ascii, utf-16, etc. > Once you have decoded the byte sequence into a sequence of characters, only > then can you parse the URI.
Hmm. Well, I have been reading the spec the other way around: first you parse the URI to get the bytes, then you use encoding information to interpret the bytes. I think this curious passage from Section 2.5 is interesting to consider here: For most systems, an unreserved character appearing within a URI component is interpreted as representing the data octet corresponding to that character's encoding in US-ASCII. Consumers of URIs assume that the letter "X" corresponds to the octet "01011000", and even when that assumption is incorrect, there is no harm in making it. A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets. I am really not sure how to interpret this. I have been reading '%' in productions as '0b00100101' and I have written my parser this way; but that is probably backwards thinking. > ...let's say we have the path segments ["foo", "bar/baz"] and we wish to use > them in the path info of a URI. Because / is a special character it must be > percent encoded as %2F. So, the path info for the url would be: > > foo/bar%2Fbaz > > If we had the path segments, ["foo","bar","baz"], however that would be > encoded as: > > foo/bar/baz > > Now let's look at decoding the path. If we simple decode the percent encoded > characters and give the user a ByteString then both urls will decode to: > > pack "foo/bar/baz" > > Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent > different paths. The percent encoding there is required to distinguish > between to two unique paths. I read the section on paths differently: a path is sequence of bytes, wherein slash runs are not permitted, among other rules. However, re-reading the section, a big todo is made about hierarchical data and path normalization; it really seems your interpretation is the correct one. I tried it out in cURL, for example: http://www.ietf.org/rfc%2Frfc3986.txt # 404 Not Found http://www.ietf.org/rfc/rfc3986.txt # 200 OK My recently released released URL parser/pretty-printer is actually wrong in its handling of paths and, when corrected, will only amount to a parser of URLs that are encoded in US-ASCII and supersets thereof. -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
