On 16/02/13 02:50, L Walsh wrote: > Ángel González wrote: >> On 07/02/13 15:06, bes wrote: >>> Hi, >>> >>> i found some bug in wget with interpreting and save percent-encoding >>> 3 byte >>> utf8 url >>> >>> example: >>> 1. Create url with "—". This is U+2014 (EM DASH). Percent-encoding >>> UTF-8 is >>> "%E2%80%94" >>> 2. Try wget it: wget "http://example.com/abc—d" or wget " >>> http://example.com/abc%E2%80%94d" directly >>> 3. Wget save this URL to file "abc\342%80%94d". Expected is >>> "abc%E2%80%94d". This is a bug. >> >> The problem is that it checks if it's a printable character in latin1. > Do you mean printable character in the current locale? No, I mean in latin1 (ISO-8859-1). If it founds a ‘character’ like bell (0x07), wget doesn't try to put that in the filename but to left it as %07 The codepoints 7F-9F are defined in ISO-8859 as control codes (the C1 set: Start of Selected Area, Partial Line Forward...) so wget also does that. However, this has an implicit reasoning that the url is in the iso-8859 family, something which used to be common, but using utf-8 there is quite usual nowadays. BUT some utf-8 characters use bytes in the 7F-9F range, so wget leaves them as %xy while leaving some others as the referenced bytes. Breaking the utf-8 encoding even for systems using filenames in utf-8.
> Or can it not do UTF-8 at all? > > latin1 is going the way of the dodo...most sites still use it, but > HTML5 is supposed to be UTF8.. http://www.whatwg.org/specs/web-apps/current-work/#urls refers to http://url.spec.whatwg.org/ and it does set the encoding by default to utf-8. But I think it refers to /encoding/ a character, not to figure out which encoding was used in a url. We could assume it's the same charset as the document, but what to do with documents with no charset (by wrong configuration, or for being scripts, images...) ? Seems easier to treat as utf-8 if it contains utf-8 sequences. That still needs a transformation of filenames, though. > If it found "González" on a file would it be able to save it correctly? wget is always able to download the urls, the only difference is if they "look nice" in your system. A url like http://example.org/González in utf-8 would be encoded as http://example.org/Gonz%c3%a1lez so wget would think those are the characters à (0xC3) and ¡ (0xA1), saving it "as is". So if my filenames are utf-8 (eg. Linux) I will see it as González, if they are latin1 (eg. Windows, using windows-1252) I will see it as González.
