subject:"Re\: \[Bug\-wget\] bad filenames $again$"

On Fri, Aug 21, 2015 at 12:07:56PM +0200, Tim Ruehsen wrote:

 The charset is *not* determined (guessed) from the URL string, be it hex 
 encoded or not. We take the locale setup as default, but it can be overridden 
 by --local-encoding. Right now, Wget does not have the ability to have 
 different encodings for file input (--input-file) and input via STDIN (when 
 used at the same time). But that is another issue...

It seems to me that I keep saying the same thing. We are not communicating.
You talk about locale and local-encoding but that is not the point.

There is a remote site.
Nothing is known about this remote site.
Certainly there is no reason to assume that it uses a character set
that is related to the local setup of the machine here that runs wget.

Since nothing is known about this remote site, it is impossible
to know the character set (if any) of the filenames. And hence
it is impossible to invoke iconv, since iconv requires a
from-charset and a to-charset.

Also the user does not know yet what character set this remote site
is using. And it might use more than one. So the user cannot in general
give a --from-charset option.

In this situation: what do you do?

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Ruehsen

On Friday 21 August 2015 13:00:34 Andries E. Brouwer wrote:
 On Fri, Aug 21, 2015 at 12:07:56PM +0200, Tim Ruehsen wrote:
  The charset is *not* determined (guessed) from the URL string, be it hex
  encoded or not. We take the locale setup as default, but it can be
  overridden by --local-encoding. Right now, Wget does not have the ability
  to have different encodings for file input (--input-file) and input via
  STDIN (when used at the same time). But that is another issue...
 
 It seems to me that I keep saying the same thing. We are not communicating.
Yes, I am also under this impression :-(

 You talk about locale and local-encoding but that is not the point.
Sorry, exactly that seems to be the point.

 There is a remote site.
 Nothing is known about this remote site.
Wrong. Regarding HTTP(S), we exactly know the encoding of each downloaded HTML 
and CSS document (that's what I call 'remote encoding'). It is only these type 
of (downloaded) files we scan when going recursive.
If the server (or document) states a wrong encoding (e.g. *saying* it has 
Japanese/EUC-JP encoding, but in fact it is iso-8859-1 encoded), we either 
have to use escaping or the user uses a --remote-encoding to override the 
wrong server/document statement.

But leaving these misconfigured servers away as a special case, we are fine.

You might take a look at http://www.w3.org/TR/html4/charset.html#h-5.2.2 which 
describes how servers and clients should work regarding HTML character 
encoding (there should be something for CSS as well out there).

Andries, if you still have the impression that we are not communicating, I 
suggest that you make up a simple example test case to show your problem (and 
excuse me please for being kinda dump/blind). Maybe two small HTML files with 
references to each other to demonstrate your point. (I can put them on my 
server and start wget/wget2 on it to see if it works or not).

Regards, Tim


signature.asc
Description: This is a digitally signed message part.

Re: [Bug-wget] bad filenames (again)

On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:

  There is a remote site.
  Nothing is known about this remote site.

 Wrong. Regarding HTTP(S), we exactly know the encoding
 of each downloaded HTML and CSS document
 (that's what I call 'remote encoding').

You are an optimist. In my experience Firefox rarely gets it right.
Let me find some random site. Say
http://web2go.board19.com/gopro/go_view.php?id=12345

If I go there with Firefox, I get a go board with a lot of mojibake
around it. Firefox took the encoding to be Unicode. Trying out what
I have to say in the Text encoding menu, it turns out to be
Chinese, Traditional.

 Leaving these misconfigured servers away as a special case

But most of the East Asian servers I meet are misconfigured in this way.
They announce text/html with charset utf-8 and come with some random
charset.
So trusting this announced charset should be done cautiously.

And you say misconfigured servers, but often one gets a
Unix or Windows file hierarchy, and several character sets occur.
The server doesnt know. The sysadmin doesnt know. A university
machine will have many users with files in several languages
and character sets.

Moreover, the character set of a filename is in general unrelated
to the character set of the contents of the file. That is most clear
when the file is not a text file. What character set is the filename

http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

in? You recognize ISO 8859-1 or similar. My local machine is on UTF-8.
The HTTP headers say Content-Type: image/jpeg.
How can wget guess?

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Ruehsen

On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:
 On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:
   There is a remote site.
   Nothing is known about this remote site.
  
  Wrong. Regarding HTTP(S), we exactly know the encoding
  of each downloaded HTML and CSS document
  (that's what I call 'remote encoding').
 
 You are an optimist. In my experience Firefox rarely gets it right.
 Let me find some random site. Say
 http://web2go.board19.com/gopro/go_view.php?id=12345

I try to be an optimist in all situations, yes :-)

 If I go there with Firefox, I get a go board with a lot of mojibake
 around it. Firefox took the encoding to be Unicode. Trying out what
 I have to say in the Text encoding menu, it turns out to be
 Chinese, Traditional.

The server tell us the document is UTF-8.
The document tell us it is 'UTF-8.
But then, some moron (there are a lot of these dudes doing webpage 'design') 
put non UTF-8 text into the document.
That is like putting plum pudding into a jar labeled 'strawberry jam'. You 
will you do ? Go back and return it ? Or accept it saying 'uh oh, my 
strawberry allergy will bite me, but I am a tough guy'.

*BUT* that is not the point for wget, since wget doesn't mess around with the 
texttual content (no conversion takes place). When used recursive, wget will 
extract URLs from the document. *NOT* from the text but from the HTML 
tags/attributes. And *surprise*, all of the links in the document are UTF-8 / 
ASCII (else not a single browser in the world would expect anything else).
And all that matters are the URLs from the HTML attributes.

 And you say misconfigured servers, but often one gets a
 Unix or Windows file hierarchy, and several character sets occur.
 The server doesnt know. The sysadmin doesnt know. A university
 machine will have many users with files in several languages
 and character sets.

Trust them, They know. If not, their web site will be heavily broken.
But there is nothing to fix for us.

 Moreover, the character set of a filename is in general unrelated
 to the character set of the contents of the file. That is most clear
 when the file is not a text file. What character set is the filename
 
 http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

Wrong question. It is a JPEG file. Content doesn't matter to wget.

Despite from that, if you want to download the above mentioned web page and 
you have a UTF-8 locale, you have to tell wget via --local-encoding what 
encoding the URL is. But if wget --recursive finds the above URL within a HTML 
attribute, you won't need --local-encoding. By the measures taken from 
http://www.w3.org/TR/html4/charset.html#h-5.2.2, wget will know the correct 
encoding and just will do the right thing (after the currently discussed 
change regarding charsets / file naming). Wget2 already does it.


$ wget --local-encoding=iso-8859-1 
'http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg'
--2015-08-21 16:30:05--  
http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg
Resolving www.win.tue.nl (www.win.tue.nl)... 131.155.0.177
Connecting to www.win.tue.nl (www.win.tue.nl)|131.155.0.177|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-08-21 16:30:05 ERROR 404: Not Found.

--2015-08-21 16:30:05--  
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
Reusing existing connection to www.win.tue.nl:80.
HTTP request sent, awaiting response... 200 OK
Length: 11690 (11K) [image/jpeg]
Saving to: ‘knäckebröd.jpg’

knäckebröd.jp   
100%[=]
  
11.42K  --.-KB/s   in 0.002s 

2015-08-21 16:30:05 (6.83 MB/s) - ‘knäckebröd.jpg’ saved [11690/11690]


(Old wget having the progress bar bug.)


Tim


signature.asc
Description: This is a digitally signed message part.

Re: [Bug-wget] bad filenames (again)

On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote:
On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:

Let me find some random site. Say
http://web2go.board19.com/gopro/go_view.php?id=12345

The server tell us the document is UTF-8.
The document tell us it is 'UTF-8.

And it is not. So - this example establishes that remote character set
information, when present, is often unreliable.

Let me add one more example,

http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html

a famous Danish recipe. The headers say Content-Type: text/html
without revealing any character set.

Moreover, the character set of a filename is in general unrelated
to the character set of the contents of the file. That is most clear
when the file is not a text file. What character set is the filename

http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

Wrong question. It is a JPEG file. Content doesn't matter to wget.

Hmm. I thought the topic of our discussion was filenames and character sets.
Here is a file, and its name is in ISO 8859-1.
When wget saves it. What will the filename be?

If you want to download the above mentioned web page and
you have a UTF-8 locale, you have to tell wget via --local-encoding what
encoding the URL is.

Are you sure you do not mean --remote-encoding?

But whatever you mean, it is an additional option.
If the wget user already knows the character set, she can of course tell wget.

The discussion is about the situation where the user does not know.

So, that is the situation we are discussing: a remote site, the user
does not know what encoding is used (she will find out after downloading),
and the headers have either no information or wrong information.
Now if one invokes iconv it is likely that garbage will be the result.

Andries

Here a Korean example.
http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
The http headers say Content-Type: text/plain; charset=iso-8859-1
(which is incorrect), an internal header says that this is ISO-2022-KR
(which is also incorrect), in fact the content is in EUC-KR.
That is none of wget's business, we want to save this file.
The headers say
Content-Disposition: attachment;
filename=20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf
This encodes a valid utf-8 filename, and that name should be used.
So wget should save this file under the name
20101202_농심신라면배_바둑(다카오신지9단-백_.sgf

Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Rühsen

Am Freitag, 21. August 2015, 17:28:09 schrieb Andries E. Brouwer:
On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote:
On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:
Let me find some random site. Say
http://web2go.board19.com/gopro/go_view.php?id=12345

The server tell us the document is UTF-8.
The document tell us it is 'UTF-8.

And it is not. So - this example establishes that remote character set
information, when present, is often unreliable.

Let me add one more example,

http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html

a famous Danish recipe. The headers say Content-Type: text/html
without revealing any character set.

1. There is no URL to parse in this document, so encoding does not matter
anyway.

2. If the server AND the document do not explicitly specify the character
encoding, there still is one - namely the default. Has been ISO-8859-1 a while
ago. AFAIR, HTML5 might have changed that (too late for me now to look it up).

The is a good diagram - maybe not perfectly up-to-date but it still shows
roughly how to operate:
http://nikitathespider.com/articles/EncodingDivination.html

http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

Wrong question. It is a JPEG file. Content doesn't matter to wget.

Hmm. I thought the topic of our discussion was filenames and character sets.
Here is a file, and its name is in ISO 8859-1.
When wget saves it. What will the filename be?

If you want to download the above mentioned web page and
you have a UTF-8 locale, you have to tell wget via --local-encoding what
encoding the URL is.

Are you sure you do not mean --remote-encoding?

Yes, I am sure. Here my tests (my locale is UTF-8):

Wrong:
$ wget -nv --remote-encoding=iso-8859-1
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
2015-08-21 20:09:30 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
[11690/11690] - kn�ckebr�d.jpg.1 [1]

Right:
http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg:
2015-08-21 20:14:18 FEHLER 404: Not Found.
2015-08-21 20:14:18 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
[11690/11690] - knäckebröd.jpg [1]

But whatever you mean, it is an additional option.
If the wget user already knows the character set, she can of course tell
wget.

The discussion is about the situation where the user does not know.

Here a Korean example.
http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
The http headers say Content-Type: text/plain; charset=iso-8859-1
(which is incorrect), an internal header says that this is ISO-2022-KR
(which is also incorrect), in fact the content is in EUC-KR.
That is none of wget's business, we want to save this file.
The headers say
Content-Disposition: attachment;
filename=20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%
EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%E
B%B0%B1_.sgf This encodes a valid utf-8 filename, and that name should be
used. So wget should save this file under the name
20101202_농심신라면배_바둑(다카오신지9단-백_.sgf

This is a different issue. Here we are talking about the encoding of HTTP
headers, especially 'filename' values within Content-Disposition HTTP header.
The above is correctly encoded (UTF-8 percent encoding).

The encoding is described in RFC5987 (Character Set and Language Encoding for
Hypertext Transfer Protocol (HTTP) Header Field Parameters).

Wget simply does not parse this correctly - it is just not coded in.
That is why support for Content-Disposition in Wget is documented as
'experimental' (you have to explicitly enable it via --content-disposition).

Again the server encoding is known. Regarding filename encoding, nothing is
wrong in your example. It is just Wget missing some code here (worth opening a
separate bug).

Default Wget behavior:
$ wget -nv http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
2015-08-21 20:20:05
URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] -
1847B5314CF754B83134B7 [1]

Enabled Content-Disposition support:
$ wget -nv --content-disposition
http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
2015-08-21 20:23:50
URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] -
20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf

[1]

As we see, unescaping and UTF-8 to locale

Re: [Bug-wget] bad filenames (again)

On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote:

  Content-Disposition: attachment;
  filename=20101202_%EB...%A8-%EB%B0%B1_.sgf
  This encodes a valid utf-8 filename, and that name should be used.
  So wget should save this file under the name
  20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
 
 This is a different issue. Here we are talking about the encoding of HTTP 
 headers, especially 'filename' values within Content-Disposition HTTP header.
 Wget simply does not parse this correctly - it is just not coded in.
 It is just Wget missing some code here (worth opening a separate bug).

Good, saved for later.

 If the server AND the document do not explicitly specify the character 
 encoding, there still is one - namely the default. Has been ISO-8859-1
 a while ago. AFAIR, HTML5 might have changed that (too late for me now
 to look it up).

Yes - that is our main difference. You read the standard and find there
what everyone is supposed to do, or what the default is.
I download stuff from the net and encounter lots of things people do,
that are perhaps not according to the most recent standard,
and may differ from the default.

As a consequence I prefer to base the decision about what to do
on the form of the filename (ASCII / UTF-8 / other), not on the
headers encountered on the way to this file.

Fortunately, almost all URLs are in ASCII - no problem.
Fortunately, almost all that are not in ASCII, are UTF-8.
The good thing of UTF-8 is that it has a quite typical bit pattern.
A non-ASCII filename that is valid UTF-8 is very likely UTF-8.
So, one can recognize ASCII and UTF-8 rather reliably.

(By the way, I checked my conjecture that iconv from UTF-8
to UTF-8 need not be the identity map, and that is indeed the case.
On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.)

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Ruehsen

On Friday 21 August 2015 02:08:43 Andries E. Brouwer wrote:
 On Thu, Aug 20, 2015 at 10:47:35AM +0200, Tim Ruehsen wrote:
  Basically, I keep track of the charset of each URL input
  (command line, input file, stdin, downloaded+scanned).
 
 It seems to me, you can't. Consider for example a command line
 that gives a URL hex escaped. Now the command line is pure ASCII
 and gives no information at all about the character set of the filename.

The charset is *not* determined (guessed) from the URL string, be it hex 
encoded or not. We take the locale setup as default, but it can be overridden 
by --local-encoding. Right now, Wget does not have the ability to have 
different encodings for file input (--input-file) and input via STDIN (when 
used at the same time). But that is another issue...

Tim


signature.asc
Description: This is a digitally signed message part.

Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Andries E. Brouwer

On Wed, Aug 19, 2015 at 05:38:39PM +0300, Eli Zaretskii wrote:

  Assign a character set as follows:
  - if the user specified a from-charset, use that
  - if the name is printable ASCII (in 0x20-0x7f), take ASCII
  - if the name is non-ASCII and valid UTF-8, take UTF-8
  - otherwise take Unknown.
 
 I think this is simpler and produces the same results:
  - if the user specified a from-charset, use that
  - otherwise assume UTF-8

Simpler, but the results are not the same.

If the from-charset is unknown, then any call of iconv will certainly
lead to bad results. So there are only the two possibilities:
(i) leave as-is (if that is the user's preference)
(ii) make pure ASCII via hex escapes.

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Eli Zaretskii

 From: Tim Ruehsen tim.rueh...@gmx.de
 Cc: Andries E. Brouwer andries.brou...@cwi.nl
 Date: Thu, 20 Aug 2015 10:47:35 +0200

  Tim says he has some/most of that coded on a branch, so I think we
  should start by merging that branch, and then take it from there.

 It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 
 'click on the merge button' to merge.
 Basically, I keep track of the charset of each URL input (command line, input 
 file, stdin, downloaded+scanned). So when generating the filename we have the 
 to and from charset. When iconv fails here (e.g. Chinese input, ASCII 
 output), 
 escaping takes place.

Sounds good to me.  Is something holding the merge of this to master?

Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Andries E. Brouwer

On Wed, Aug 19, 2015 at 09:46:04PM +0300, Eli Zaretskii wrote:

 OK, but how is this different from what we'd get using your suggested
 4 alternatives?

What can I reply? Just read my letter again.
I think I said what I wanted to say.

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Tim Ruehsen

On Thursday 20 August 2015 17:39:09 Eli Zaretskii wrote:
  From: Tim Ruehsen tim.rueh...@gmx.de
  Cc: Andries E. Brouwer andries.brou...@cwi.nl
  Date: Thu, 20 Aug 2015 10:47:35 +0200

   Tim says he has some/most of that coded on a branch, so I think we
   should start by merging that branch, and then take it from there.

  It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can
  just 'click on the merge button' to merge.
  Basically, I keep track of the charset of each URL input (command line,
  input file, stdin, downloaded+scanned). So when generating the filename
  we have the to and from charset. When iconv fails here (e.g. Chinese
  input, ASCII output), escaping takes place.

 Sounds good to me.  Is something holding the merge of this to master?

Sorry it should have been so you *can't* just 'click on the merge button' to 
merge :-) I have to do some more organizational stuff over there before I 
introduce an official alpha version (but it is working already with a bunch of 
new features).

Tim

signature.asc
Description: This is a digitally signed message part.

Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Andries E. Brouwer

On Wed, Aug 19, 2015 at 10:46:30PM +0300, Eli Zaretskii wrote:

 OK, then let me explain my line of reasoning.  Plain ASCII is valid
 UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
 know it's not valid UTF-8.  So the last 3 possibilities in your
 suggestion boil down to try converting as if it were UTF-8, and if
 that fails, you know it's Unknown.

Yes, although I would not invoke iconv to actually convert from UTF-8 to
UTF-8. Unicode is a complicated beast, and it is not certain that
conversion from UTF-8 to UTF-8 is the identity transformation.
(For example, implementations may prefer either NFC or NFD.
MacOS has its own NFD-like version for filenames.)
But you are right, one can use it as test.

After finding out that the charset is unknown I want to hex-encode
the entire filename. On the other hand, if the appropriate thing
is to invoke iconv to convert from one charset to another, I want
to hex-encode only the failing bytes.

This difference because (a) if there is reason to expect that
conversion should be possible, for example because the user
specified the from-charset as GB18030, and it fails, then often
only in a few isolated places where Microsoft extensions are used,
and it is more user-friendly to do the conversion where possible.
but (b) if nothing is known, then the character set can be a
multibyte one like SJIS where ASCII bytes occur as second halves
of symbols, and not escaping such ASCII bytes is confusing
and sometimes leads to strange problems.

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Tim Ruehsen

On Wednesday 19 August 2015 17:38:39 Eli Zaretskii wrote:
  Date: Wed, 19 Aug 2015 02:52:57 +0200
  From: Andries E. Brouwer andries.brou...@cwi.nl
  Cc: bug-wget@gnu.org

  Look at the remote filename.

  Assign a character set as follows:
  - if the user specified a from-charset, use that
  - if the name is printable ASCII (in 0x20-0x7f), take ASCII
  - if the name is non-ASCII and valid UTF-8, take UTF-8
  - otherwise take Unknown.

 I think this is simpler and produces the same results:
  - if the user specified a from-charset, use that
  - otherwise assume UTF-8

  Determine a local character set as follows:
  - if the user specified a to-charset, use that
  - if the locale uses UTF-8, use that
  - otherwise take ASCII

 I suggest this instead:
  - if the user specified a to-charset, use that
  - otherwise, call nl_langinfo(CODESET) to find out the current
locale's encoding

  Convert the name from from-charset to to-charset:
  - if the user asked for unmodified filenames, do nothing
  - if the name is ASCII, do nothing
  - if the name is UTF-8 and the locale uses UTF-8, do nothing
  - convert from Unknown by hex-escaping the entire name
  - convert to ASCII by hex-escaping the entire name
  - otherwise invoke iconv(); upon failure, escape the illegal bytes

 My suggestion:
  - if the user asked for unmodified filenames, do nothing
  - else invoke 'iconv' to convert from remote to local encoding
  - if 'iconv' fails, convert to ASCII by hex-escaping

 Hex-escaping only the bytes that fail 'iconv' is better than
 hex-escaping all of them, but it's more complex, and I'm not sure it's
 worth the hassle.  But if it can be implemented without undue trouble,
 I'm all for it, as it will make wget more user-friendly in those
 cases.

  Once we know what we want it is trivial to write the code,
  but it may take a while to figure out what we want.
  I think we should start applying the current patch.

 Tim says he has some/most of that coded on a branch, so I think we
 should start by merging that branch, and then take it from there.

It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 
'click on the merge button' to merge.
Basically, I keep track of the charset of each URL input (command line, input 
file, stdin, downloaded+scanned). So when generating the filename we have the 
to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), 
escaping takes place.

Tim

Re: [Bug-wget] bad filenames (again)

 Date: Wed, 19 Aug 2015 02:52:57 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: bug-wget@gnu.org

 Look at the remote filename.

 Assign a character set as follows:
 - if the user specified a from-charset, use that
 - if the name is printable ASCII (in 0x20-0x7f), take ASCII
 - if the name is non-ASCII and valid UTF-8, take UTF-8
 - otherwise take Unknown.

I think this is simpler and produces the same results:
 - if the user specified a from-charset, use that
 - otherwise assume UTF-8

 Determine a local character set as follows:
 - if the user specified a to-charset, use that
 - if the locale uses UTF-8, use that
 - otherwise take ASCII

I suggest this instead:
 - if the user specified a to-charset, use that
 - otherwise, call nl_langinfo(CODESET) to find out the current
   locale's encoding

 Convert the name from from-charset to to-charset:
 - if the user asked for unmodified filenames, do nothing
 - if the name is ASCII, do nothing
 - if the name is UTF-8 and the locale uses UTF-8, do nothing
 - convert from Unknown by hex-escaping the entire name
 - convert to ASCII by hex-escaping the entire name
 - otherwise invoke iconv(); upon failure, escape the illegal bytes

My suggestion:
 - if the user asked for unmodified filenames, do nothing
 - else invoke 'iconv' to convert from remote to local encoding
 - if 'iconv' fails, convert to ASCII by hex-escaping

Hex-escaping only the bytes that fail 'iconv' is better than
hex-escaping all of them, but it's more complex, and I'm not sure it's
worth the hassle.  But if it can be implemented without undue trouble,
I'm all for it, as it will make wget more user-friendly in those
cases.

 Once we know what we want it is trivial to write the code,
 but it may take a while to figure out what we want.
 I think we should start applying the current patch.

Tim says he has some/most of that coded on a branch, so I think we
should start by merging that branch, and then take it from there.

Re: [Bug-wget] bad filenames (again)

 Date: Tue, 18 Aug 2015 22:28:21 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org

  What is needed to have a full Unicode support in wget on Windows is to
  provide replacements for all the file-name related libc functions
  ('fopen', 'open', 'stat', 'access', etc.) which will accept file names
  encoded in UTF-8, convert them internally into UTF-16, and call the
  wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
  '_waccess', etc.) with the converted file name.  Another thing that is
  needed is similar replacements for 'printf', 'puts', 'fprintf',
  etc. when they are used for writing file names to the console --
  because we cannot write UTF-8 sequences to the Windows console.

 Aha. That reminds me of a patch by I think Aleksey Bykov.
 Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html

 There we had a similar discussion, and he wrote mswindows.diff with

 +int 
 +wc_utime (unsigned char *filename, struct _utimbuf *times)
 +{
 +  wchar_t *w_filename;
 +  int buffer_size;
 +
 +  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, 
 -1, 
 w_filename, 0);
 +  w_filename = alloca (buffer_size);
 +  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
 +  return _wutime (w_filename, times);
 +}

 and similar for stat, open, etc. Something similar is what would be needed on 
 Windows?

Yes, thanks for pointing out those patches.  Any reasons they weren't
accepted back then?

 Is his patch usable?

It needs some minor polishing, but in general it should do the job,
yes.

I admit that I don't understand the need for the url.c patch.  Why do
we need to convert to wchar_t when the locale's codeset is already
UTF-8?  (I could understand that for non-UTF-8 locales, but the patch
explicitly limits the conversion to wchar_t and back to UTF-8 locales,
where the normal string functions should do the job.)  Is this only
for converting to upper/lower-case?

There's still the part with writing UTF-8 encoded file/URL names to
the Windows console; that will have to be added.

Re: [Bug-wget] bad filenames (again)

 Date: Wed, 19 Aug 2015 01:43:51 +0200
 From: Ángel González keis...@gmail.com

 +int
 +wc_utime (unsigned char *filename, struct _utimbuf *times)
 +{
 +  wchar_t *w_filename;
 +  int buffer_size;
 +
 +  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, 
 filename, -1, 
 w_filename, 0);
 +  w_filename = alloca (buffer_size);
 +  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
 +  return _wutime (w_filename, times);
 +}

 and similar for stat, open, etc. Something similar is what would be 
 needed on 
 Windows?
 Is his patch usable? Maybe I also commented a little in
 http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
 but after that nothing happened, it seems.

 That would probably work, but would need a review. On a quick look, some of 
 the functions have memory leaks (seems he first used malloc, then changed to 
 alloca just some of them).

Indeed.  Actually, there's no need to allocate memory dynamically,
neither will malloc nor with alloca, since Windows file names have
fixed size limitation that is known in advance.  So each conversion
function can use a fixed-sized local wchar_t array.  Doing that will
also avoid the need for 2 calls to MultiByteToWideChar, the first one
to find out how much space to allocate.

 And of course, there's the question of what to do if the filename we are 
 trying to convert to utf-16 is not in fact valid utf-8.

The calls to MultiByteToWideChar should use a flag
(MB_ERR_INVALID_CHARS) in its 2nd argument that makes the function
fail with a distinct error code in that case.  When it fails like
that, the wc_* wrappers should simply call the normal unibyte
functions with the original 'char *' argument.  This makes the
modified code fall back on previous behavior when the source file
names are not in UTF-8.

And regardless, wget should convert to the locale's codeset (on all
platforms).  Once the above patches are accepted, the Windows build
will pretend that its locale's codeset is UTF-8, and that will ensure
the conversions with MultiByteToWideChar will work in most situations.

Re: [Bug-wget] bad filenames (again)

 Date: Wed, 19 Aug 2015 20:50:55 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, keis...@gmail.com,
 bug-wget@gnu.org

 On Wed, Aug 19, 2015 at 09:46:04PM +0300, Eli Zaretskii wrote:

  OK, but how is this different from what we'd get using your suggested
  4 alternatives?

 What can I reply? Just read my letter again.
 I think I said what I wanted to say.

OK, then let me explain my line of reasoning.  Plain ASCII is valid
UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
know it's not valid UTF-8.  So the last 3 possibilities in your
suggestion boil down to try converting as if it were UTF-8, and if
that fails, you know it's Unknown.

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 11:58:54AM +0200, Tim Ruehsen wrote:

  Unix filenames are sequences of bytes, they do not have a character set.
 
 The character encoding makes with what symbols these bytes
 (or byte sequences aka multibyte / codepoints) are displayed for you.

Sure. So each time I load a different font, I see different glyphs
for my symbols. The file with single-byte name 0xff will look like
a Dutch ligature ij in some fonts, and quite different in other fonts.

The point is: it is the user's choice to load a font. (Or to set a locale.)
The filenames themselves do not carry additional information
about their character set.
For historical reasons a single directory can have files with names
in several character sets.

All this is about the local situation. One cannot know the character set
of a filename because that concept does not exist in Unix.
About the remote situation even less is known. It would be terrible
if wget decided to use obscure heuristics to invent a remote character set
and then invoke iconv.

Andries

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 10:29:40AM +0200, Tim Ruehsen wrote:

 I am going with Eli that we should use iconv.
 We know the remote encoding and the local encoding

Do we?

How do you guess the remote encoding?
Is there any particular encoding?
Unix filenames are sequences of bytes, they do not have a character set.

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Tim Ruehsen

On Monday 17 August 2015 22:51:12 Andries E. Brouwer wrote:
 On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote:
  what do we want to achieve here, and why is what wget did
  before your patch the wrong thing?
 
 Wget modified filenames, and users are unhappy.
 See
 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745
 http://savannah.gnu.org/bugs/?37564
 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
 http://stackoverflow.com/questions/27054765/wget-japanese-characters
 http://www.win.tue.nl/~aeb/linux/misc/wget.html
 etc.
 
 It is debatable what precisely would be the right thing,
 but my patch greatly increases the number of happy users.
 Further improvement is possible.
 For example, nothing was changed yet for Windows, but also
 Windows users complain about this wget escaping.

I am going with Eli that we should use iconv.
We know the remote encoding and the local encoding, so I don't see a problem 
here. There are a few cases (when using --input-file) where we have to tell 
wget the encoding via --remote-encoding.

On Windows we very often have the default locale Windows-1252 (aka CP1252) 
which is a superset of iso-8859-1. While web servers more and more often 
deliver content encoded as UTF-8. A UTF-8 filename of 'ö.html' (\C3x\B6x.html) 
should be saved as CP1252 ö.html (\F6x.html). If conversion is not possible 
due to characters not included into CP1252, we should fallback to escaping ( 
as improvement we could first try to convert codepoint by codepoint and just 
escape the ones not convertable).

This already done in 'wget2' branch where it can be tested (using src2/wget2). 
We just have to backport it to Wget 'master' branch. For me, this is just a 
matter of available time.

Tim


signature.asc
Description: This is a digitally signed message part.

Re: [Bug-wget] bad filenames (again)

On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote:

 what do we want to achieve here, and why is what wget did
 before your patch the wrong thing?

Wget modified filenames, and users are unhappy.
See
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745
http://savannah.gnu.org/bugs/?37564
http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
http://stackoverflow.com/questions/27054765/wget-japanese-characters
http://www.win.tue.nl/~aeb/linux/misc/wget.html
etc.

It is debatable what precisely would be the right thing,
but my patch greatly increases the number of happy users.
Further improvement is possible.
For example, nothing was changed yet for Windows, but also
Windows users complain about this wget escaping.

Andries

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 07:39:40PM +0300, Eli Zaretskii wrote:

  No. An exact copy allows me to decide what I have.
 
 Which is the heuristic you want this to be solved.  IMO, such a
 heuristic will not server most of the users in most of use cases.
 Users just want wget to DTRT automatically, and have the file names
 legible.

Let me see whether I understand you correctly.

You want to do the right thing. You think that the right thing
would be to invoke iconv. Since the original character set is
unknown to user and wget, you have to guess. What could one guess?
If the string is ASCII, fine. If the string is valid UTF-8, fine.
If the user has specified the character set, fine.
Otherwise? Leave it as it is?

Andries

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 07:43:05PM +0300, Eli Zaretskii wrote:

   If we convert the file names using iconv, Windows users will also be
   happier, at least when the remote URL can be encoded in their system
   codepage.
  
  Windows does not differ from Unix - since the remote character set
  is unknown and not necessarily constant, a conversion is impossible.
 
 Windows does differ from Unix, in that arbitrary byte sequences cannot
 be used in file names.

Of course. The code already tries to take care of that.

  See
 
   
 https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
 
 for the gory details.

Thanks for the reference!

  I already indicated the 1-line change that fixes the Windows problems.
 
 It doesn't, unfortunately.

You are too brief. What is wrong with the change that changes
/* insert some test for Windows */
into
return true;
?

That change only changes what wget does with bytes in the 128-159 range,
and reading the gory details I fail to see any problem. Almost the opposite:
Use any character in the current code page for a name, including Unicode 
characters
 and characters in the extended character set (128–255)
At first sight, if there were a problem it would be because of the clause
Any other character that the target file system does not allow.

Thanks to your reference I now feel confident to make that 1-line change
so that also Windows users are happy.

Andries


(There are restrictions involving filenames that wget perhaps does not enforce:
no LPT3, no final space or period, ... It might be useful to teach wget about
such details.)

Re: [Bug-wget] bad filenames (again)

Date: Tue, 18 Aug 2015 19:51:58 +0200
From: Andries E. Brouwer andries.brou...@cwi.nl
Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
bug-wget@gnu.org

On Tue, Aug 18, 2015 at 07:43:05PM +0300, Eli Zaretskii wrote:

If we convert the file names using iconv, Windows users will also be
happier, at least when the remote URL can be encoded in their system
codepage.

Windows does not differ from Unix - since the remote character set
is unknown and not necessarily constant, a conversion is impossible.

Windows does differ from Unix, in that arbitrary byte sequences cannot
be used in file names.

Of course. The code already tries to take care of that.

It does that badly.

See

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx

for the gory details.

Thanks for the reference!

You are welcome.

I already indicated the 1-line change that fixes the Windows problems.

It doesn't, unfortunately.

You are too brief. What is wrong with the change that changes
/* insert some test for Windows */
into
return true;
?

It preserves the current behavior, whereby almost every non-ASCII URL
out there gets saved in a file name that is either inaccessible to
localized programs, or shows as illegible mujibake.

That change only changes what wget does with bytes in the 128-159 range,
and reading the gory details I fail to see any problem. Almost the opposite:
Use any character in the current code page for a name, including Unicode
characters
and characters in the extended character set (128–255)

You need to read between the lines, as it's Microsoft speak. First,
not every codepoint between 128 and 255 is valid in every codepage.
Second, Windows stores file names in UTF-16, so it attempts to convert
the byte stream into UTF-16 assuming the byte stream is in the current
codepage (which is incorrect in most cases, as we get UTF-8 instead).
The result is an utmost mess.

Thanks to your reference I now feel confident to make that 1-line change
so that also Windows users are happy.

Do you still think that? Then allow me a small demonstration:

D:\usr\eli\datawget
https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
--2015-08-18 21:23:38--
https://ru.wikipedia.org/wiki/%D7%80%C2%A1%D7%80%C2%B5%D7%81%E2%82%AC%D7%80%C2%B4%D7%81%E2%80%A0%D7%80%C2%B5
Loaded CA certificate 'd:/usr/etc/ssl/ca-bundle.crt'
Resolving ru.wikipedia.org (ru.wikipedia.org)... 91.198.174.192
Connecting to ru.wikipedia.org (ru.wikipedia.org)|91.198.174.192|:443...
connected.
HTTP request sent, awaiting response... 404 Not Found
2015-08-18 21:23:39 ERROR 404: Not Found.

--2015-08-18 21:23:39--
https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
Reusing existing connection to ru.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'

╫%80┬í╫%80┬╡╫%81Γ%8 [ = ] 180.32K 923KB/s in 0.2s

2015-08-18 21:23:40 (923 KB/s) - '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'
saved [184652]

Do you really think that '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'
is a good way to express 'Сердце'? Do you think someone will be able
to read and understand such a file name? How would you go about
converting it back to what it should be?

(There are restrictions involving filenames that wget perhaps does not
enforce:
no LPT3, no final space or period, ... It might be useful to teach wget about
such details.)

Indeed. But that's a different issue, I think.

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 09:15:40PM +0300, Eli Zaretskii wrote:

  Otherwise? Leave it as it is?

 No, encode it as %XX hex escapes, thus making the file name pure
 ASCII.  And have an option to leave it as is, so people who want
 that could have that.

OK, I can live with that.


On Tue, Aug 18, 2015 at 09:32:16PM +0300, Eli Zaretskii wrote:

 Second, Windows stores file names in UTF-16, so it attempts to convert
 the byte stream into UTF-16 assuming the byte stream is in the current
 codepage (which is incorrect in most cases, as we get UTF-8 instead).
 The result is an utmost mess.

Yes, conversion always leads to a problems.
So, I see that you want to use iconv to convert UTF-8 to the current
codepage, so that Windows can convert that to UTF-16 again.
As stated several times already I have zero experience on Windows,
but is it possible to let wget change its current codepage to Unicode
so that the Windows conversion is close to the identity map?
It seems silly to have a double conversion with data loss
if just a format conversion would suffice.

Andries

Re: [Bug-wget] bad filenames (again)

 Date: Tue, 18 Aug 2015 21:32:16 +0300
 From: Eli Zaretskii e...@gnu.org
 Cc: bug-wget@gnu.org

   --2015-08-18 21:23:39--  
 https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
   Reusing existing connection to ru.wikipedia.org:443.
   HTTP request sent, awaiting response... 200 OK
   Length: unspecified [text/html]
   Saving to: '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'

   ╫%80┬í╫%80┬╡╫%81Γ%8 [ =  ] 180.32K   923KB/s   in 0.2s

   2015-08-18 21:23:40 (923 KB/s) - 
 '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' saved [184652]

 Do you really think that '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'
 is a good way to express 'Сердце'?  Do you think someone will be able
 to read and understand such a file name?  How would you go about
 converting it back to what it should be?

And of course the file name that is written is yet a different
mojibake: '׳%80ֲ¡׳%80ֲµ׳%81ג%82¬׳%80ֲ´׳%81ג%80 ׳%80ֲµ' (copied from the
directory listing displayed by UTF-16 capable Emacs).  Note that it
has right-to-left characters in it (probably because my locale is for
the Hebrew language), to make it even less legible due to display-time
reordering per the Unicode UAX#9.

Re: [Bug-wget] bad filenames (again)

 Date: Tue, 18 Aug 2015 21:11:25 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org

 On Tue, Aug 18, 2015 at 09:15:40PM +0300, Eli Zaretskii wrote:

   Otherwise? Leave it as it is?

  No, encode it as %XX hex escapes, thus making the file name pure
  ASCII.  And have an option to leave it as is, so people who want
  that could have that.

 OK, I can live with that.

Great, I'm glad we've found an agreeable compromise.

 So, I see that you want to use iconv to convert UTF-8 to the current
 codepage, so that Windows can convert that to UTF-16 again.

Yes.

 As stated several times already I have zero experience on Windows,
 but is it possible to let wget change its current codepage to Unicode
 so that the Windows conversion is close to the identity map?

No, it's not possible.  Windows does have a UTF-8 codepage, but it
doesn't allow setting that as the system codepage.

What is needed to have a full Unicode support in wget on Windows is to
provide replacements for all the file-name related libc functions
('fopen', 'open', 'stat', 'access', etc.) which will accept file names
encoded in UTF-8, convert them internally into UTF-16, and call the
wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
'_waccess', etc.) with the converted file name.  Another thing that is
needed is similar replacements for 'printf', 'puts', 'fprintf',
etc. when they are used for writing file names to the console --
because we cannot write UTF-8 sequences to the Windows console.  Doing
this is not rocket science (I did something similar for Emacs last
year), but more work than just a call to iconv that's needed on Unix.

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 10:31:31PM +0300, Eli Zaretskii wrote:

  Is it possible to let wget change its current codepage to Unicode
  so that the Windows conversion is close to the identity map?
 
 No, it's not possible.  Windows does have a UTF-8 codepage, but it
 doesn't allow setting that as the system codepage.
 
 What is needed to have a full Unicode support in wget on Windows is to
 provide replacements for all the file-name related libc functions
 ('fopen', 'open', 'stat', 'access', etc.) which will accept file names
 encoded in UTF-8, convert them internally into UTF-16, and call the
 wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
 '_waccess', etc.) with the converted file name.  Another thing that is
 needed is similar replacements for 'printf', 'puts', 'fprintf',
 etc. when they are used for writing file names to the console --
 because we cannot write UTF-8 sequences to the Windows console.

Aha. That reminds me of a patch by I think Aleksey Bykov.
Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html

There we had a similar discussion, and he wrote mswindows.diff with

+int 
+wc_utime (unsigned char *filename, struct _utimbuf *times)
+{
+  wchar_t *w_filename;
+  int buffer_size;
+
+  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, 
w_filename, 0);
+  w_filename = alloca (buffer_size);
+  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
+  return _wutime (w_filename, times);
+}

and similar for stat, open, etc. Something similar is what would be needed on 
Windows?
Is his patch usable? Maybe I also commented a little in
http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
but after that nothing happened, it seems.

Andries

Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Ángel González


On 18/08/15 22:28, Andries E. Brouwer wrote:

On Tue, Aug 18, 2015 at 10:31:31PM +0300, Eli Zaretskii wrote:

No, it's not possible.  Windows does have a UTF-8 codepage, but it
doesn't allow setting that as the system codepage.

What is needed to have a full Unicode support in wget on Windows is to
provide replacements for all the file-name related libc functions
('fopen', 'open', 'stat', 'access', etc.) which will accept file names
encoded in UTF-8, convert them internally into UTF-16, and call the
wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
'_waccess', etc.) with the converted file name.  Another thing that is
needed is similar replacements for 'printf', 'puts', 'fprintf',
etc. when they are used for writing file names to the console --
because we cannot write UTF-8 sequences to the Windows console.

Aha. That reminds me of a patch by I think Aleksey Bykov.
Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html

There we had a similar discussion, and he wrote mswindows.diff with

+int
+wc_utime (unsigned char *filename, struct _utimbuf *times)
+{
+  wchar_t *w_filename;
+  int buffer_size;
+
+  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, 
w_filename, 0);
+  w_filename = alloca (buffer_size);
+  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
+  return _wutime (w_filename, times);
+}

and similar for stat, open, etc. Something similar is what would be needed on 
Windows?
Is his patch usable? Maybe I also commented a little in
http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
but after that nothing happened, it seems.

Andries
That would probably work, but would need a review. On a quick look, some 
of the functions have memory leaks (seems he first used malloc, then 
changed to alloca just some of them).


And of course, there's the question of what to do if the filename we are 
trying to convert to utf-16 is not in fact valid utf-8.

Re: [Bug-wget] bad filenames (again)

On Wed, Aug 19, 2015 at 01:43:51AM +0200, Ángel González wrote:

 And of course, there's the question of what to do if the filename we
 are trying to convert to utf-16 is not in fact valid utf-8.

My current understanding:

(i) there is a current patch, that fixes most problems on Unix
and can be applied today

(ii) one also wants to fix Windows problems, and in the process
do something more general for Unix. We can discuss a future
patch that does something like:

Look at the remote filename.

Assign a character set as follows:
- if the user specified a from-charset, use that
- if the name is printable ASCII (in 0x20-0x7f), take ASCII
- if the name is non-ASCII and valid UTF-8, take UTF-8
- otherwise take Unknown.

Determine a local character set as follows:
- if the user specified a to-charset, use that
- if the locale uses UTF-8, use that
- otherwise take ASCII

Convert the name from from-charset to to-charset:
- if the user asked for unmodified filenames, do nothing
- if the name is ASCII, do nothing
- if the name is UTF-8 and the locale uses UTF-8, do nothing
- convert from Unknown by hex-escaping the entire name
- convert to ASCII by hex-escaping the entire name
- otherwise invoke iconv(); upon failure, escape the illegal bytes

See whether the resulting name can be used. On Unix all strings
(without NUL and '/') are ok. On Windows there are many restrictions.
Further hex escape problematic characters on Windows.

Since conversions to 8-bit character sets will often fail,
it is desirable to convince Windows to use Unicode as current codeset.
Maybe that requires a copy of the common fileio routines.

That is my view of the result of the present conversation.
Probably some refinements will be needed. Moreover, there is
interference with iri stuff that should be looked at.

Once we know what we want it is trivial to write the code,
but it may take a while to figure out what we want.
I think we should start applying the current patch.

Andries

Re: [Bug-wget] bad filenames (again)

 Date: Tue, 18 Aug 2015 12:55:50 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: bug-wget@gnu.org, Andries E. Brouwer andries.brou...@cwi.nl,
 Eli Zaretskii e...@gnu.org

 The point is: it is the user's choice to load a font. (Or to set a locale.)

Most users never change a locale, unless they are trying something
special, precisely because their file names will display as mujibake.
So wget should IMO by default cater to this use case, and allow saving
the bytes verbatim as an option.

 For historical reasons a single directory can have files with names
 in several character sets.

Again, this is a rare situation.  We shouldn't punish the majority on
behalf of such rare use cases.

 All this is about the local situation. One cannot know the character set
 of a filename because that concept does not exist in Unix.

Of course, it exists.  The _filesystem_ doesn't know it, but users do.

 About the remote situation even less is known.

Assuming UTF-8 will go a long way towards resolving this.  When this
is not so, we have the --remote-encoding switch.

 It would be terrible if wget decided to use obscure heuristics to
 invent a remote character set and then invoke iconv.

But what you suggest instead -- create a file name whose bytes are an
exact copy of the remote -- is just another heuristic.  And the
effects are no less terrible, because file names will become
illegible, especially on systems where UTF-8 is not the locale's
codeset.

I'm okay with having an option to do that, but it shouldn't be the
default, IMO.

Re: [Bug-wget] bad filenames (again)

 Date: Tue, 18 Aug 2015 17:28:34 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org

   About the remote situation even less is known.

  Assuming UTF-8 will go a long way towards resolving this.  When this
  is not so, we have the --remote-encoding switch.

 This is wget. The user is recursively downloading a file hierarchy.
 Only after downloading does it become clear what one has got.

In some use cases, yes.  In most others, no: the encoding is known in
advance.

 I download a collection of East Asian texts on some topic.
 Upon examination, part is in SJIS, part in Big5, part in EUC-JP,
 part in UTF-8. Since the downloaded stuff does not have a uniform
 character set, and surely the server is not going to specify
 character sets, any invocation of iconv will corrupt my data.
 When I get the unmodified data I look using browser or editor
 or xterm+luit for which character set setting I get readable text.

I already said that wget should support this use case.  I just don't
think it should be the default.

   It would be terrible if wget decided to use obscure heuristics to
   invent a remote character set and then invoke iconv.

  But what you suggest instead -- create a file name whose bytes are an
  exact copy of the remote -- is just another heuristic.

 No. An exact copy allows me to decide what I have.

Which is the heuristic you want this to be solved.  IMO, such a
heuristic will not server most of the users in most of use cases.
Users just want wget to DTRT automatically, and have the file names
legible.

 Conversion leads to data loss.

When it does, or there's a risk that it does, users should use
optional features to countermand that.

Re: [Bug-wget] bad filenames (again)

 Date: Tue, 18 Aug 2015 17:56:30 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org

   For example, nothing was changed yet for Windows, but also
   Windows users complain about this wget escaping.

  If we convert the file names using iconv, Windows users will also be
  happier, at least when the remote URL can be encoded in their system
  codepage.

 Windows does not differ from Unix - since the remote character set
 is unknown and not necessarily constant, a conversion is impossible.

Windows does differ from Unix, in that arbitrary byte sequences cannot
be used in file names.  See

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx

for the gory details.

 I already indicated the 1-line change that fixes the Windows problems.

It doesn't, unfortunately.

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 05:45:13PM +0300, Eli Zaretskii wrote:

  All this is about the local situation. One cannot know the character set
  of a filename because that concept does not exist in Unix.
 
 Of course, it exists.  The _filesystem_ doesn't know it, but users do.

Usually, yes.

  About the remote situation even less is known.
 
 Assuming UTF-8 will go a long way towards resolving this.  When this
 is not so, we have the --remote-encoding switch.

This is wget. The user is recursively downloading a file hierarchy.
Only after downloading does it become clear what one has got.

I download a collection of East Asian texts on some topic.
Upon examination, part is in SJIS, part in Big5, part in EUC-JP,
part in UTF-8. Since the downloaded stuff does not have a uniform
character set, and surely the server is not going to specify
character sets, any invocation of iconv will corrupt my data.
When I get the unmodified data I look using browser or editor
or xterm+luit for which character set setting I get readable text.

  It would be terrible if wget decided to use obscure heuristics to
  invent a remote character set and then invoke iconv.
 
 But what you suggest instead -- create a file name whose bytes are an
 exact copy of the remote -- is just another heuristic.

No. An exact copy allows me to decide what I have.
Conversion leads to data loss.

Andries

Re: [Bug-wget] bad filenames (again)

On Tue, Aug 18, 2015 at 06:22:41PM +0300, Eli Zaretskii wrote:

  It is debatable what precisely would be the right thing,
  but my patch greatly increases the number of happy users.
 
 AFAIU, it does that only when the target locale is UTF-8.
 By using iconv we can make wget DTRT in more locales.

No, because wget, and the invoker of wget, does not know
the remote character set. And there need not be one.
A Chinese site often has a mixture of material in
Traditional Chinese and Simplified Chinese.
Any conversion would just make the stuff unreadable. 

  For example, nothing was changed yet for Windows, but also
  Windows users complain about this wget escaping.
 
 If we convert the file names using iconv, Windows users will also be
 happier, at least when the remote URL can be encoded in their system
 codepage.

Windows does not differ from Unix - since the remote character set
is unknown and not necessarily constant, a conversion is impossible.
I already indicated the 1-line change that fixes the Windows problems.

Andries

Re: [Bug-wget] bad filenames (again)