Dear Perl Unicode Experts,
I tried to have a look at how much would have to be done to get the URI and URI::Escape modules to support IRIs in a reasonable way. The IRI spec has just been published as an IETF Proposed Standard at http://www.ietf.org/rfc/rfc3987.txt. Also, a new version of the URI spec is now Internet Standard 66 and is available at http://www.ietf.org/rfc/rfc3986.txt.
I'm looking for two things: a) short-term, how to get IRI support using the above and maybe some additional modules b) long-term, how to make these modules (and maybe others) work with IRIs as well as with the new URI spec
Support for these new specs mainly includes the following things: 1) Escaping with %hh is based on UTF-8, not some local character encoding 2) URIs now allow %hh in the host name part, and require that it is interpreted as UTF-8 3) IDNs (i.e. conversion to punycode, and if possibly also nameprep/stringprep) should be supported 4) The user of e.g. the URI module should ideally only have to deal with one form of the URI/IRI, the one used to construct the URI/IRI, although it should be possible to create other forms (e.g. a fully %-encoded URI, an IRI that contains as few %hh as possible) 5) It should be possible to apply normalization operations as described in the IRI spec on different parts of an URI/IRI
I started with some very simple (I thought) tests, but got completely confused very quickly. Here is the short program that I was using:
>>>> test.pl use utf8; use URI; use URI::Escape;
print (uri_escape("\xFD") . "\n"); print (iri_escape("\xFD") . "\n"); print (uri_escape("\x{FD}") . "\n"); print (iri_escape("\x{FD}") . "\n"); print (uri_escape("\x{370}") . "\n"); print (iri_escape("\x{370}") . "\n");
sub iri_escape { return substr (uri_escape("\x{370}".shift), 6); } >>>>
With this, on perl, v5.6.1 built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail), I get
>>>> %FD %C3%BD %C3%BD %C3%BD %CD%B0 %CD%B0 >>>>
which seems to show that the trick with adding a non-Latin-1 character and then removing its escaped form works (compare the first line to the second line).
However, on perl, v5.8.4 built for i386-linux-thread-multi, I get:
>>>> %FD
%FD
>>>>
Nothing seems to work anymore, although (or because?) 5.8 has better Unicode support.
Any help appreciated.
Regards, Martin.