On Monday 17 August 2015 22:51:12 Andries E. Brouwer wrote: > On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote: > > what do we want to achieve here, and why is what wget did > > before your patch the wrong thing? > > Wget modified filenames, and users are unhappy. > See > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745 > http://savannah.gnu.org/bugs/?37564 > http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors > http://stackoverflow.com/questions/27054765/wget-japanese-characters > http://www.win.tue.nl/~aeb/linux/misc/wget.html > etc. > > It is debatable what precisely would be the right thing, > but my patch greatly increases the number of happy users. > Further improvement is possible. > For example, nothing was changed yet for Windows, but also > Windows users complain about this wget escaping.
I am going with Eli that we should use iconv. We know the remote encoding and the local encoding, so I don't see a problem here. There are a few cases (when using --input-file) where we have to tell wget the encoding via --remote-encoding. On Windows we very often have the default locale Windows-1252 (aka CP1252) which is a superset of iso-8859-1. While web servers more and more often deliver content encoded as UTF-8. A UTF-8 filename of 'ö.html' (\C3x\B6x.html) should be saved as CP1252 ö.html (\F6x.html). If conversion is not possible due to characters not included into CP1252, we should fallback to escaping ( as improvement we could first try to convert codepoint by codepoint and just escape the ones not convertable). This already done in 'wget2' branch where it can be tested (using src2/wget2). We just have to backport it to Wget 'master' branch. For me, this is just a matter of available time. Tim
signature.asc
Description: This is a digitally signed message part.
