Re: IDN patch for wget

Hrvoje Niksic Wed, 22 Mar 2006 13:50:33 -0800

Juho Vähä-Herttua <[EMAIL PROTECTED]> writes:

> On 22.3.2006, at 17:10, Hrvoje Niksic wrote:
>> Can you elaborate on this?  What I had in mind was:
>>
>> 1. start with a stream of UTF-16 sequences
>> 2. convert that into a string of UCS code points
>> 3. encode that into UTF-8
>> now work with UTF-8 consistently
>>
>> What do you mean by file names as "escpaed UTF-16"?
>
> I will take that back a little, after trying it in real life it's
> actually not very bad idea.


Thanks.  The nice thing about such a transformation is that the
program doesn't have to be aware of the properties of individual
unicode characters, it only needs to mechanically decode UTF-16 and
encode UTF-8, which shouldn't be hard given that both encodings are
fairly simple and well-defined.

> What I meant was that URL uses 8-bit escaping, but if UTF-16 strings
> have non-ASCII characters how are they encoded?

In Wget's scenario they would be converted to a sequence of UTF-8
bytes representing those non-ASCII characters, and subsequently each
byte would be escaped as %HH, as you say Opera and Firefox do.  In
fact, everything would work exactly the same way as if the HTML
contained the link encoded as UTF-8.  Since most web servers accept
UTF-8 in URLs (or so I'm told), there's a good chance that things
would "just work".

>> So Wget has not only to call libidn, but also to call an unspecified
>> library that converts charsets encountered in HTML (potentially a
>> large set) to Unicode?
>
> Libidn links to iconv (which is a prerequisite for any
> internationalization)

Please note that Wget is portable to systems without iconv, or with
old and very limited versions of iconv.  While I agree that using
iconv is beneficial for many purposes, I also believe that Wget should
try to remain as flexible as possible, which includes doing the right
thing even without external libraries like iconv.  The
above-referenced UTF-16 -> UTF-8 transformation is an example of that.

For proper construction of IDN hostnames, something like iconv (and
possibly also libidn) appears to be necessary.  Maybe that was part of
the reason why I wasn't too eager to consider IDN in Wget, at least
until it becomes popular enough for supporting it to be a necessity.

> To answer earlier comments, I never remember saying my patch is
> complete or full and proper IDN support.

You're right, you didn't say that.  I understood that from the
statement that it's "very easy to do with GNU libidn", whereas I now
see that you meant that GNU libidn makes construction of IDN easy,
nothing else.

>And my question about DNS queries can be expressed with following
>patch. Why not do:
>
> --- clip ---
> Index: src/url.c
> ===================================================================
> --- src/url.c   (revision 2135)
> +++ src/url.c   (working copy)
> @@ -836,8 +836,8 @@
>        converted to %HH by reencode_escapes).  */
>     if (strchr (u->host, '%'))
>       {
> -      url_unescape (u->host);
> -      host_modified = true;
> +      error_code = PE_INVALID_HOST_NAME;
> +      goto error;
>       }
>     if (params_b)
> --- clip ---

Because that would break, for example, this URL:

    http://%77%77%77.%63%6e%6e.%63%6f%6d/

which Opera and Konqueror handle (but Firefox doesn't).  I kind of
like the idea that % escapes work anywhere in the URL.  I'm not sure
if the current RFC's consider the above an acceptable alternative to
"http://www.cnn.com/";.

We might want to search for invalid characters after unescaping the
host name, but I saw no reason to do that.  For one, getaddrinfo is (I
suppose) perfectly capable of rejecting invalid host names or simply
of not finding them.  Also, if such host names happen to (somehow)
work in some environments (for example by libc's getaddrinfo
performing the IDN translation automatically), why prevent that?

I hope this answers your question.

Re: IDN patch for wget

Reply via email to