> Date: Tue, 18 Aug 2015 17:28:34 +0200 > From: "Andries E. Brouwer" <[email protected]> > Cc: "Andries E. Brouwer" <[email protected]>, [email protected], > [email protected] > > > > About the remote situation even less is known. > > > > Assuming UTF-8 will go a long way towards resolving this. When this > > is not so, we have the --remote-encoding switch. > > This is wget. The user is recursively downloading a file hierarchy. > Only after downloading does it become clear what one has got.
In some use cases, yes. In most others, no: the encoding is known in advance. > I download a collection of East Asian texts on some topic. > Upon examination, part is in SJIS, part in Big5, part in EUC-JP, > part in UTF-8. Since the downloaded stuff does not have a uniform > character set, and surely the server is not going to specify > character sets, any invocation of iconv will corrupt my data. > When I get the unmodified data I look using browser or editor > or xterm+luit for which character set setting I get readable text. I already said that wget should support this use case. I just don't think it should be the default. > > > It would be terrible if wget decided to use obscure heuristics to > > > invent a remote character set and then invoke iconv. > > > > But what you suggest instead -- create a file name whose bytes are an > > exact copy of the remote -- is just another heuristic. > > No. An exact copy allows me to decide what I have. Which is the heuristic you want this to be solved. IMO, such a heuristic will not server most of the users in most of use cases. Users just want wget to DTRT automatically, and have the file names legible. > Conversion leads to data loss. When it does, or there's a risk that it does, users should use optional features to countermand that.
