-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Brian Keck wrote:
> Hello,
> 
> I'm wondering if I've found a bug in the excellent wget.
> I'm not asking for help, because it turned out not to be the reason
> one of my scripts was failing.
> 
> The possible bug is in the derivation of the filename from a URL which
> contains UTF-8.
> 
> The case is:
> 
>   wget http://en.wikipedia.org/wiki/%C3%87atalh%C3%B6y%C3%BCk
> 
> Of course these are all ascii characters, but underlying it are
> 3 nonascii characters, whose UTF-8 encoding is:
> 
>   hex    octal     name
>   ----  -------  ---------
>   C387  303 274  C-cedilla
>   C3B6  303 266  o-umlaut
>   C3BC  303 274  u-umlaut
> 
> The file created has a name that's almost, but not quite, a valid UTF-8
> bytestring ... 
> 
>   ls *y*k | od -tc
>   0000000 303   %   8   7   a   t   a   l   h 303 266   y 303 274   k  \n
> 
> Ie the o-umlaut & u-umlaut UTF-8 encodings occur in the bytestring,
> but the UTF-8 encoding of C-cedilla has its 2nd byte replaced by the
> 3-byte string "%87".

Using --restrict=nocontrol will do what you want it to, in this instance.

> I'm guessing this is not intended.  

Actually, it is (more-or-less).

Realize that Wget really has no idea how to tell whether you're trying
to give it UTF-8, or one of the ISO latin charsets. It tends to assume
the latter. It also, by default, will not create filenames with control
characters in them. In ISO latin, characters in the range 0x80-0x9f are
control characters, which is why Wget left %87 escaped, which falls into
that range, but not the others, which don't.

It is actually illegal to specify byte values outside the range of ASCII
characters in a URL, but it has long been historical practice to do so
anyway. In most cases, the intended meaning was one of the latin
character sets (usually latin1), so Wget was right to do as it does, at
that time.

There is now a standard for representing Unicode values in URLs, whose
result is then called IRLs (Internationalized Resource Locators).
Conforming correctly to this standard would require that Wget be
sensitive to the context and encoding of documents in which it finds
URLs; in the case of filenames and command arguments, it would probably
also require sensitivity to the current locale as determined by
environment variables. Wget is simply not equipped to handle IRLs or
encoding issues at the moment, so until it is, a proper fix will not be
in place. Addressing these are considered a "Wget 2.0" (next-generation
Wget functionality) priority, and probably won't be done for a year or
two, given that the number of developers involved with Wget, if you add
up all the part-time helpers (including me), is probably still less than
one full-time dev. :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBSHX7M8hyUobTrERCKRLAJwKiDOo0uO7x/k/iAEB/W0pPQmUJQCfUHaP
c6k2490strgy1Efy1DmiOhA=
=7lvZ
-----END PGP SIGNATURE-----

Reply via email to