Juho Vähä-Herttua <[EMAIL PROTECTED]> writes: > On 22.3.2006, at 17:10, Hrvoje Niksic wrote: >> Can you elaborate on this? What I had in mind was: >> >> 1. start with a stream of UTF-16 sequences >> 2. convert that into a string of UCS code points >> 3. encode that into UTF-8 >> now work with UTF-8 consistently >> >> What do you mean by file names as "escpaed UTF-16"? > > I will take that back a little, after trying it in real life it's > actually not very bad idea.
Thanks. The nice thing about such a transformation is that the program doesn't have to be aware of the properties of individual unicode characters, it only needs to mechanically decode UTF-16 and encode UTF-8, which shouldn't be hard given that both encodings are fairly simple and well-defined. > What I meant was that URL uses 8-bit escaping, but if UTF-16 strings > have non-ASCII characters how are they encoded? In Wget's scenario they would be converted to a sequence of UTF-8 bytes representing those non-ASCII characters, and subsequently each byte would be escaped as %HH, as you say Opera and Firefox do. In fact, everything would work exactly the same way as if the HTML contained the link encoded as UTF-8. Since most web servers accept UTF-8 in URLs (or so I'm told), there's a good chance that things would "just work". >> So Wget has not only to call libidn, but also to call an unspecified >> library that converts charsets encountered in HTML (potentially a >> large set) to Unicode? > > Libidn links to iconv (which is a prerequisite for any > internationalization) Please note that Wget is portable to systems without iconv, or with old and very limited versions of iconv. While I agree that using iconv is beneficial for many purposes, I also believe that Wget should try to remain as flexible as possible, which includes doing the right thing even without external libraries like iconv. The above-referenced UTF-16 -> UTF-8 transformation is an example of that. For proper construction of IDN hostnames, something like iconv (and possibly also libidn) appears to be necessary. Maybe that was part of the reason why I wasn't too eager to consider IDN in Wget, at least until it becomes popular enough for supporting it to be a necessity. > To answer earlier comments, I never remember saying my patch is > complete or full and proper IDN support. You're right, you didn't say that. I understood that from the statement that it's "very easy to do with GNU libidn", whereas I now see that you meant that GNU libidn makes construction of IDN easy, nothing else. >And my question about DNS queries can be expressed with following >patch. Why not do: > > --- clip --- > Index: src/url.c > =================================================================== > --- src/url.c (revision 2135) > +++ src/url.c (working copy) > @@ -836,8 +836,8 @@ > converted to %HH by reencode_escapes). */ > if (strchr (u->host, '%')) > { > - url_unescape (u->host); > - host_modified = true; > + error_code = PE_INVALID_HOST_NAME; > + goto error; > } > if (params_b) > --- clip --- Because that would break, for example, this URL: http://%77%77%77.%63%6e%6e.%63%6f%6d/ which Opera and Konqueror handle (but Firefox doesn't). I kind of like the idea that % escapes work anywhere in the URL. I'm not sure if the current RFC's consider the above an acceptable alternative to "http://www.cnn.com/". We might want to search for invalid characters after unescaping the host name, but I saw no reason to do that. For one, getaddrinfo is (I suppose) perfectly capable of rejecting invalid host names or simply of not finding them. Also, if such host names happen to (somehow) work in some environments (for example by libc's getaddrinfo performing the IDN translation automatically), why prevent that? I hope this answers your question.