Re: [Bug-wget] bad filenames (again)

Tim Ruehsen Tue, 18 Aug 2015 04:14:14 -0700

On Monday 17 August 2015 22:51:12 Andries E. Brouwer wrote:
> On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote:
> > what do we want to achieve here, and why is what wget did
> > before your patch the wrong thing?
> 
> Wget modified filenames, and users are unhappy.
> See
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745
> http://savannah.gnu.org/bugs/?37564
> http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
> http://stackoverflow.com/questions/27054765/wget-japanese-characters
> http://www.win.tue.nl/~aeb/linux/misc/wget.html
> etc.
> 
> It is debatable what precisely would be the right thing,
> but my patch greatly increases the number of happy users.
> Further improvement is possible.
> For example, nothing was changed yet for Windows, but also
> Windows users complain about this wget escaping.


I am going with Eli that we should use iconv.
We know the remote encoding and the local encoding, so I don't see a problem 
here. There are a few cases (when using --input-file) where we have to tell 
wget the encoding via --remote-encoding.

On Windows we very often have the default locale Windows-1252 (aka CP1252) 
which is a superset of iso-8859-1. While web servers more and more often 
deliver content encoded as UTF-8. A UTF-8 filename of 'ö.html' (\C3x\B6x.html) 
should be saved as CP1252 ö.html (\F6x.html). If conversion is not possible 
due to characters not included into CP1252, we should fallback to escaping ( 
as improvement we could first try to convert codepoint by codepoint and just 
escape the ones not convertable).

This already done in 'wget2' branch where it can be tested (using src2/wget2). 
We just have to backport it to Wget 'master' branch. For me, this is just a 
matter of available time.

Tim

signature.asc
Description: This is a digitally signed message part.

Re: [Bug-wget] bad filenames (again)

Reply via email to