-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Brian Keck wrote: > Hello, > > I'm wondering if I've found a bug in the excellent wget. > I'm not asking for help, because it turned out not to be the reason > one of my scripts was failing. > > The possible bug is in the derivation of the filename from a URL which > contains UTF-8. > > The case is: > > wget http://en.wikipedia.org/wiki/%C3%87atalh%C3%B6y%C3%BCk > > Of course these are all ascii characters, but underlying it are > 3 nonascii characters, whose UTF-8 encoding is: > > hex octal name > ---- ------- --------- > C387 303 274 C-cedilla > C3B6 303 266 o-umlaut > C3BC 303 274 u-umlaut > > The file created has a name that's almost, but not quite, a valid UTF-8 > bytestring ... > > ls *y*k | od -tc > 0000000 303 % 8 7 a t a l h 303 266 y 303 274 k \n > > Ie the o-umlaut & u-umlaut UTF-8 encodings occur in the bytestring, > but the UTF-8 encoding of C-cedilla has its 2nd byte replaced by the > 3-byte string "%87".
Using --restrict=nocontrol will do what you want it to, in this instance. > I'm guessing this is not intended. Actually, it is (more-or-less). Realize that Wget really has no idea how to tell whether you're trying to give it UTF-8, or one of the ISO latin charsets. It tends to assume the latter. It also, by default, will not create filenames with control characters in them. In ISO latin, characters in the range 0x80-0x9f are control characters, which is why Wget left %87 escaped, which falls into that range, but not the others, which don't. It is actually illegal to specify byte values outside the range of ASCII characters in a URL, but it has long been historical practice to do so anyway. In most cases, the intended meaning was one of the latin character sets (usually latin1), so Wget was right to do as it does, at that time. There is now a standard for representing Unicode values in URLs, whose result is then called IRLs (Internationalized Resource Locators). Conforming correctly to this standard would require that Wget be sensitive to the context and encoding of documents in which it finds URLs; in the case of filenames and command arguments, it would probably also require sensitivity to the current locale as determined by environment variables. Wget is simply not equipped to handle IRLs or encoding issues at the moment, so until it is, a proper fix will not be in place. Addressing these are considered a "Wget 2.0" (next-generation Wget functionality) priority, and probably won't be done for a year or two, given that the number of developers involved with Wget, if you add up all the part-time helpers (including me), is probably still less than one full-time dev. :) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBSHX7M8hyUobTrERCKRLAJwKiDOo0uO7x/k/iAEB/W0pPQmUJQCfUHaP c6k2490strgy1Efy1DmiOhA= =7lvZ -----END PGP SIGNATURE-----