Re: [Bug-wget] bad filenames (again)

2015-08-25 Thread Andries E. Brouwer
On Mon, Aug 24, 2015 at 03:44:09PM +0200, Tim Ruehsen wrote:

 Just implemented (or let's say fixed) Content-Disposition in wget2. It now
 saves the file as
 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf

Good!

 Content-Disposition (filename, filename*) is standardized, but browsers seems 
 to behave/parse very different, ignoring standards.

Yes. On the web a general phenomenon is that non-specialists create websites.
They know nothing about standards, but fiddle until it works (say, with IE6).
Also Microsoft does/did not respect standards.

A consequence is that practice is more important than theory.
One has to try for robust solutions.

  I prefer to base the decision about what to do on the form
  of the filename (ASCII / UTF-8 / other), not on the
  headers encountered on the way to this file.
 
 I guess we can find an easy agreement.
 
 1. Wget has to obey the defaults. If it fails or we find a well-known 
 misbehavior (server/document fault), handle it automatically.
 That's how we try do do it now.
 
 2. If still a problem arises, the user should be able to intercept. Using 
 special command line options for fine-tuning Wget's behavior.

Yes, whatever the user says, we do, the case where options have been given
is nonproblematic.

Remains your point 1. I am not sure what you think the defaults are.

My basic example is the %-encoded pure ASCII url, referring to a non-text
object. How should wget save the object? There is zero charset information.
My answer today (after conversation with Eli) is:
Decode the %-encoded string. The last part is the suggested filename.
If it is ASCII, use that ASCII name (where valid for the OS).
If it is UTF-8 (but not ASCII), use it when the locale is UTF-8,
otherwise convert (if possible) or escape.  If it is not UTF-8, escape.

[That is, I would recognize only what is easy to recognize,
and prefer not to rely on any headers. Prefer not to convert
except possibly in the UTF-8 case.]

How does your answer differ?
Some ancient docs say that ISO-8859-1 is a default. Even if such docs
have not yet been replaced, I feel that they no longer reflect current
practice. ISO-8859-x is dying. All the web should converge to Unicode,
whatever that may be.

The relevant example might be that
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
I have the impression that you are happy with kn=C3=A4ckebr=C3=B6d.jpg
but I would be unhappy with that (although it happens to be correct),
since guessing and conversion is involved.
Guessing may not be so bad, but guessing and converting is terrible:
it can be really complicated to retrieve the original filename
after an incorrect conversion.

Andries


Another URL:
http://hongaarskinderplezier.eu/index.php?pagina=96naam=Gy%25F5r-Moson-Sopron
This is about holidays near the beautiful city Győr in Hungary.
But what happened with the city? Its name was written in ISO-8859-2,
using 0xf5, and that was %-escaped to %f5, and that was again
%-escaped to %25f5.

I would undo the %-escape and see pure ASCII, and save as
index.php?pagina=96naam=Gy%F5r-Moson-Sopron.
What would you do?
The page has meta charset=ISO-8859-2 /
The headers have Content-Type: text/html without charset information.

---

Similarly http://www.matklubben.se/recept/lchf+kn%25e4ckebr%25f6d+mandelmj%25f6l
has the %-encoded version of Lchf kn%e4ckebr%f6d mandelmj%f6l
which again encoded the ISO-8859-1 version of lchf knäckebröd mandelmjöl.

Such double encodings are not uncommon.
But as a first approximation I think wget should not try to recognize them.

---

http://www.eet-china.com/SEARCH/ART/%EF%BC%85C0%EF%BC%85B6%E7%9A%84%EF%BC%85D1%E7%9A%84%EF%BC%85C0.HTM
ends in %C0%B6的%D1的%C0.HTM - this is an %-encoding using fat %-signs (U+ff05).

You see that one can encounter all levels of messiness.





Re: [Bug-wget] bad filenames (again)

2015-08-24 Thread Tim Ruehsen
On Saturday 22 August 2015 00:39:01 Andries E. Brouwer wrote:
 On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote:
   Content-Disposition: attachment;
   filename=20101202_%EB...%A8-%EB%B0%B1_.sgf
   This encodes a valid utf-8 filename, and that name should be used.
   So wget should save this file under the name
   20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
  
  This is a different issue. Here we are talking about the encoding of HTTP
  headers, especially 'filename' values within Content-Disposition HTTP
  header. Wget simply does not parse this correctly - it is just not coded
  in. It is just Wget missing some code here (worth opening a separate
  bug).
 Good, saved for later.

Just implemented (or let's say fixed) Content-Disposition in wget2. It now 
saves the file as
20101202_농심신라면배_바둑(다카오신지9단-백_.sgf

Content-Disposition (filename, filename*) is standardized, but browsers seems 
to behave/parse very different, ignoring standards.
See 
http://stackoverflow.com/questions/93551/how-to-encode-the-filename-parameter-of-content-disposition-header-in-http
(answer 2 from Martin Ørding-Thomsen)

But that's just FYI. Different issue.


  If the server AND the document do not explicitly specify the character
  encoding, there still is one - namely the default. Has been ISO-8859-1
  a while ago. AFAIR, HTML5 might have changed that (too late for me now
  to look it up).
 
 Yes - that is our main difference. You read the standard and find there
 what everyone is supposed to do, or what the default is.
 I download stuff from the net and encounter lots of things people do,
 that are perhaps not according to the most recent standard,
 and may differ from the default.
 
 As a consequence I prefer to base the decision about what to do
 on the form of the filename (ASCII / UTF-8 / other), not on the
 headers encountered on the way to this file.

I guess we can find an easy agreement.

1. Wget has to obey the defaults. If it fails or we find a well-known 
misbehavior (server/document fault), handle it automatically.
That's how we try do do it now.

2. If still a problem arises, the user should be able to intercept. Using 
special command line options for fine-tuning Wget's behavior.

Of course we try our best, so that 2. is normally not necessary.

You already gave some examples, one of it (the Content-Disposition example) 
already lead to an optimization (I'll transfer the code to Wget1.x soon).
The other two obeyed the standards (one had f*cked up content, but that didn't 
touch Wget's functionality).

I would ask you to give more examples of websites that you think aren't 
standard and/or where Wget has problems parsing out the links.
That would be 50% of the work.

 (By the way, I checked my conjecture that iconv from UTF-8
 to UTF-8 need not be the identity map, and that is indeed the case.
 On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.)

We should have a 'shortcut', so if to-charset and from-charset are the same, 
we don't convert. 

Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] bad filenames (again)

2015-08-23 Thread Eli Zaretskii
 Date: Sun, 23 Aug 2015 17:16:37 +0200
 From: Ángel González keis...@gmail.com
 CC: bug-wget@gnu.org
 
 On 23/08/15 16:47, Eli Zaretskii wrote:
  Wrong. I can work with a larger one by using a UNC path.
  But then you will be unable to use relative file names, and will have
  to convert all the file names to the UNC format by hand, and any file
  names we create that exceed the 260-character limit will be almost
  unusable, since almost any program will be unable to
  read/write/delete/copy/whatever it.  So this method is impractical,
  and it doesn't lift the limit anyway, see below.
 {{reference needed}}

For what part do you need a reference?

 I'm quite sure explorer will happily work with UNC paths, which means
 the user will be able to flawlessly move/copy/delete them.

No, the Explorer cannot handle files longer than 260 characters.  The
Explorer uses shell APIs that are limited to 260 characters.

Like I said: creating files whose names are longer than 260 characters
is asking for trouble.  You will need to write your own programs to
manipulate such files.

 And actually, I think most programs will happily open (and read,
 edit, etc.) a file that was provided in UNC format.

UNC format is indeed supported by most (if not all) programs, but as
soon as the file name is longer than 260 characters, all file-related
APIs begin to fail.

  * _Some_ Windows when using _some_ filesystems / apis have fixed limits,
  but there are ways to produce larger paths...
  The issue here is not whether the size limits differ, the issue is
  whether the largest limit is still fixed.  And it is, on Windows.
  I had tried to skip over the specific details in my previous mail. I
  didn't meant that
  the limit would be bigger, but that there isn't one (that you can rely
  on, at least). On
  Windows 95/98 you had this 260 character limit, and you currently still
  do depending
  on the API you are using. But that's not a system limit any more.
  This is wrong, and the URL I posted clearly describes the limitation:
  If you use UNCs, the size is still limited to 32K characters.  So even
  if we want to convert every file name to the UNC \\?\x:\foo\bar form
  and create unusable files (which I don't recommend), the maximum
  length is still known in advance.
 Ok, it is possible that there *is* a limit of 32K characters. Still, 
 it's not a
 practical one to hardcode.

Why not?  Here's a simple code snippet that should work:

  int
  open_utf8 (const char *fn, int mode)
  {
wchar_t fn_utf16[32*1024];
int result = MultiByteToWideChar (CP_UTF8, MB_ERR_INVALID_CHARS, fn, -1,
 fn_utf16, 32*1024);

if (!result)
  {
DWORD err = GetLastError ();

switch (err)
  {
  case ERROR_INVALID_FLAGS:
  case ERROR_INVALID_PARAMETER:
errno = EINVAL;
break;
  case ERROR_INSUFFICIENT_BUFFER:
errno = ENAMETOOLONG;
break;
  case ERROR_NO_UNICODE_TRANSLATION:
  default:
errno = ENOENT;
break;
  }
return -1;
  }
return _wopen (fn_utf16, mode);
  }

 And we would be risking a stack overflow if attempting to create
 such buffer in the stack.

The default stack size of Windows programs is 2MB, so I think we are
safe using 64K here.




Re: [Bug-wget] bad filenames (again)

2015-08-23 Thread Ángel González

On 20/08/15 04:42, Eli Zaretskii wrote:

From: Ángel González wrote:

On 19/08/15 16:38, Eli Zaretskii wrote:

Indeed.  Actually, there's no need to allocate memory dynamically,
neither will malloc nor with alloca, since Windows file names have
fixed size limitation that is known in advance.  So each conversion
function can use a fixed-sized local wchar_t array.  Doing that will
also avoid the need for 2 calls to MultiByteToWideChar, the first one
to find out how much space to allocate.

Nope. These functions would receive full path names, so there's no
maximum length.*

Please see the URL I mentioned earlier in this thread: _all_ Windows
file-related APIs are limited to 260 characters, including the drive
letter and all the leading directories.

Wrong. I can work with a larger one by using a UNC path.


* _Some_ Windows when using _some_ filesystems / apis have fixed limits,
but there are ways to produce larger paths...

The issue here is not whether the size limits differ, the issue is
whether the largest limit is still fixed.  And it is, on Windows.
I had tried to skip over the specific details in my previous mail. I 
didn't meant that
the limit would be bigger, but that there isn't one (that you can rely 
on, at least). On
Windows 95/98 you had this 260 character limit, and you currently still 
do depending

on the API you are using. But that's not a system limit any more.







Re: [Bug-wget] bad filenames (again)

2015-08-23 Thread Eli Zaretskii
 Date: Sun, 23 Aug 2015 16:15:04 +0200
 From: Ángel González keis...@gmail.com
 CC: bug-wget@gnu.org
 
 On 20/08/15 04:42, Eli Zaretskii wrote:
  From: Ángel González wrote:
 
  On 19/08/15 16:38, Eli Zaretskii wrote:
  Indeed.  Actually, there's no need to allocate memory dynamically,
  neither will malloc nor with alloca, since Windows file names have
  fixed size limitation that is known in advance.  So each conversion
  function can use a fixed-sized local wchar_t array.  Doing that will
  also avoid the need for 2 calls to MultiByteToWideChar, the first one
  to find out how much space to allocate.
  Nope. These functions would receive full path names, so there's no
  maximum length.*
  Please see the URL I mentioned earlier in this thread: _all_ Windows
  file-related APIs are limited to 260 characters, including the drive
  letter and all the leading directories.
 Wrong. I can work with a larger one by using a UNC path.

But then you will be unable to use relative file names, and will have
to convert all the file names to the UNC format by hand, and any file
names we create that exceed the 260-character limit will be almost
unusable, since almost any program will be unable to
read/write/delete/copy/whatever it.  So this method is impractical,
and it doesn't lift the limit anyway, see below.

  * _Some_ Windows when using _some_ filesystems / apis have fixed limits,
  but there are ways to produce larger paths...
  The issue here is not whether the size limits differ, the issue is
  whether the largest limit is still fixed.  And it is, on Windows.
 I had tried to skip over the specific details in my previous mail. I 
 didn't meant that
 the limit would be bigger, but that there isn't one (that you can rely 
 on, at least). On
 Windows 95/98 you had this 260 character limit, and you currently still 
 do depending
 on the API you are using. But that's not a system limit any more.

This is wrong, and the URL I posted clearly describes the limitation:
If you use UNCs, the size is still limited to 32K characters.  So even
if we want to convert every file name to the UNC \\?\x:\foo\bar form
and create unusable files (which I don't recommend), the maximum
length is still known in advance.




Re: [Bug-wget] bad filenames (again)

2015-08-23 Thread Ángel González

On 23/08/15 16:47, Eli Zaretskii wrote:

Wrong. I can work with a larger one by using a UNC path.

But then you will be unable to use relative file names, and will have
to convert all the file names to the UNC format by hand, and any file
names we create that exceed the 260-character limit will be almost
unusable, since almost any program will be unable to
read/write/delete/copy/whatever it.  So this method is impractical,
and it doesn't lift the limit anyway, see below.

{{reference needed}}

I'm quite sure explorer will happily work with UNC paths, which means
the user will be able to flawlessly move/copy/delete them. And actually,
I think most programs will happily open (and read, edit, etc.) a file that
was provided in UNC format.



* _Some_ Windows when using _some_ filesystems / apis have fixed limits,
but there are ways to produce larger paths...

The issue here is not whether the size limits differ, the issue is
whether the largest limit is still fixed.  And it is, on Windows.

I had tried to skip over the specific details in my previous mail. I
didn't meant that
the limit would be bigger, but that there isn't one (that you can rely
on, at least). On
Windows 95/98 you had this 260 character limit, and you currently still
do depending
on the API you are using. But that's not a system limit any more.

This is wrong, and the URL I posted clearly describes the limitation:
If you use UNCs, the size is still limited to 32K characters.  So even
if we want to convert every file name to the UNC \\?\x:\foo\bar form
and create unusable files (which I don't recommend), the maximum
length is still known in advance.
Ok, it is possible that there *is* a limit of 32K characters. Still, 
it's not a

practical one to hardcode. And we would be risking a stack overflow if
attempting to create such buffer in the stack.






Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Andries E. Brouwer
On Fri, Aug 21, 2015 at 12:07:56PM +0200, Tim Ruehsen wrote:

 The charset is *not* determined (guessed) from the URL string, be it hex 
 encoded or not. We take the locale setup as default, but it can be overridden 
 by --local-encoding. Right now, Wget does not have the ability to have 
 different encodings for file input (--input-file) and input via STDIN (when 
 used at the same time). But that is another issue...

It seems to me that I keep saying the same thing. We are not communicating.
You talk about locale and local-encoding but that is not the point.

There is a remote site.
Nothing is known about this remote site.
Certainly there is no reason to assume that it uses a character set
that is related to the local setup of the machine here that runs wget.

Since nothing is known about this remote site, it is impossible
to know the character set (if any) of the filenames. And hence
it is impossible to invoke iconv, since iconv requires a
from-charset and a to-charset.

Also the user does not know yet what character set this remote site
is using. And it might use more than one. So the user cannot in general
give a --from-charset option.

In this situation: what do you do?

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Ruehsen
On Friday 21 August 2015 13:00:34 Andries E. Brouwer wrote:
 On Fri, Aug 21, 2015 at 12:07:56PM +0200, Tim Ruehsen wrote:
  The charset is *not* determined (guessed) from the URL string, be it hex
  encoded or not. We take the locale setup as default, but it can be
  overridden by --local-encoding. Right now, Wget does not have the ability
  to have different encodings for file input (--input-file) and input via
  STDIN (when used at the same time). But that is another issue...
 
 It seems to me that I keep saying the same thing. We are not communicating.
Yes, I am also under this impression :-(

 You talk about locale and local-encoding but that is not the point.
Sorry, exactly that seems to be the point.

 There is a remote site.
 Nothing is known about this remote site.
Wrong. Regarding HTTP(S), we exactly know the encoding of each downloaded HTML 
and CSS document (that's what I call 'remote encoding'). It is only these type 
of (downloaded) files we scan when going recursive.
If the server (or document) states a wrong encoding (e.g. *saying* it has 
Japanese/EUC-JP encoding, but in fact it is iso-8859-1 encoded), we either 
have to use escaping or the user uses a --remote-encoding to override the 
wrong server/document statement.

But leaving these misconfigured servers away as a special case, we are fine.

You might take a look at http://www.w3.org/TR/html4/charset.html#h-5.2.2 which 
describes how servers and clients should work regarding HTML character 
encoding (there should be something for CSS as well out there).

Andries, if you still have the impression that we are not communicating, I 
suggest that you make up a simple example test case to show your problem (and 
excuse me please for being kinda dump/blind). Maybe two small HTML files with 
references to each other to demonstrate your point. (I can put them on my 
server and start wget/wget2 on it to see if it works or not).

Regards, Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Andries E. Brouwer
On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:

  There is a remote site.
  Nothing is known about this remote site.

 Wrong. Regarding HTTP(S), we exactly know the encoding
 of each downloaded HTML and CSS document
 (that's what I call 'remote encoding').

You are an optimist. In my experience Firefox rarely gets it right.
Let me find some random site. Say
http://web2go.board19.com/gopro/go_view.php?id=12345

If I go there with Firefox, I get a go board with a lot of mojibake
around it. Firefox took the encoding to be Unicode. Trying out what
I have to say in the Text encoding menu, it turns out to be
Chinese, Traditional.

 Leaving these misconfigured servers away as a special case

But most of the East Asian servers I meet are misconfigured in this way.
They announce text/html with charset utf-8 and come with some random
charset.
So trusting this announced charset should be done cautiously.

And you say misconfigured servers, but often one gets a
Unix or Windows file hierarchy, and several character sets occur.
The server doesnt know. The sysadmin doesnt know. A university
machine will have many users with files in several languages
and character sets.

Moreover, the character set of a filename is in general unrelated
to the character set of the contents of the file. That is most clear
when the file is not a text file. What character set is the filename

http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

in? You recognize ISO 8859-1 or similar. My local machine is on UTF-8.
The HTTP headers say Content-Type: image/jpeg.
How can wget guess?

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Ruehsen
On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:
 On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote:
   There is a remote site.
   Nothing is known about this remote site.
  
  Wrong. Regarding HTTP(S), we exactly know the encoding
  of each downloaded HTML and CSS document
  (that's what I call 'remote encoding').
 
 You are an optimist. In my experience Firefox rarely gets it right.
 Let me find some random site. Say
 http://web2go.board19.com/gopro/go_view.php?id=12345

I try to be an optimist in all situations, yes :-)

 If I go there with Firefox, I get a go board with a lot of mojibake
 around it. Firefox took the encoding to be Unicode. Trying out what
 I have to say in the Text encoding menu, it turns out to be
 Chinese, Traditional.

The server tell us the document is UTF-8.
The document tell us it is 'UTF-8.
But then, some moron (there are a lot of these dudes doing webpage 'design') 
put non UTF-8 text into the document.
That is like putting plum pudding into a jar labeled 'strawberry jam'. You 
will you do ? Go back and return it ? Or accept it saying 'uh oh, my 
strawberry allergy will bite me, but I am a tough guy'.

*BUT* that is not the point for wget, since wget doesn't mess around with the 
texttual content (no conversion takes place). When used recursive, wget will 
extract URLs from the document. *NOT* from the text but from the HTML 
tags/attributes. And *surprise*, all of the links in the document are UTF-8 / 
ASCII (else not a single browser in the world would expect anything else).
And all that matters are the URLs from the HTML attributes.

 And you say misconfigured servers, but often one gets a
 Unix or Windows file hierarchy, and several character sets occur.
 The server doesnt know. The sysadmin doesnt know. A university
 machine will have many users with files in several languages
 and character sets.

Trust them, They know. If not, their web site will be heavily broken.
But there is nothing to fix for us.

 Moreover, the character set of a filename is in general unrelated
 to the character set of the contents of the file. That is most clear
 when the file is not a text file. What character set is the filename
 
 http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg

Wrong question. It is a JPEG file. Content doesn't matter to wget.

Despite from that, if you want to download the above mentioned web page and 
you have a UTF-8 locale, you have to tell wget via --local-encoding what 
encoding the URL is. But if wget --recursive finds the above URL within a HTML 
attribute, you won't need --local-encoding. By the measures taken from 
http://www.w3.org/TR/html4/charset.html#h-5.2.2, wget will know the correct 
encoding and just will do the right thing (after the currently discussed 
change regarding charsets / file naming). Wget2 already does it.


$ wget --local-encoding=iso-8859-1 
'http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg'
--2015-08-21 16:30:05--  
http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg
Resolving www.win.tue.nl (www.win.tue.nl)... 131.155.0.177
Connecting to www.win.tue.nl (www.win.tue.nl)|131.155.0.177|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-08-21 16:30:05 ERROR 404: Not Found.

--2015-08-21 16:30:05--  
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
Reusing existing connection to www.win.tue.nl:80.
HTTP request sent, awaiting response... 200 OK
Length: 11690 (11K) [image/jpeg]
Saving to: ‘knäckebröd.jpg’

knäckebröd.jp   
100%[=]
  
11.42K  --.-KB/s   in 0.002s 

2015-08-21 16:30:05 (6.83 MB/s) - ‘knäckebröd.jpg’ saved [11690/11690]


(Old wget having the progress bar bug.)


Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Andries E. Brouwer
On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote:
 On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:

  Let me find some random site. Say
  http://web2go.board19.com/gopro/go_view.php?id=12345

 The server tell us the document is UTF-8.
 The document tell us it is 'UTF-8.

And it is not. So - this example establishes that remote character set
information, when present, is often unreliable.

Let me add one more example, 

http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html

a famous Danish recipe. The headers say Content-Type: text/html
without revealing any character set.

  Moreover, the character set of a filename is in general unrelated
  to the character set of the contents of the file. That is most clear
  when the file is not a text file. What character set is the filename
  
  http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
 
 Wrong question. It is a JPEG file. Content doesn't matter to wget.

Hmm. I thought the topic of our discussion was filenames and character sets.
Here is a file, and its name is in ISO 8859-1.
When wget saves it. What will the filename be?

 If you want to download the above mentioned web page and 
 you have a UTF-8 locale, you have to tell wget via --local-encoding what 
 encoding the URL is.

Are you sure you do not mean --remote-encoding?

But whatever you mean, it is an additional option.
If the wget user already knows the character set, she can of course tell wget.

The discussion is about the situation where the user does not know.

So, that is the situation we are discussing: a remote site, the user
does not know what encoding is used (she will find out after downloading),
and the headers have either no information or wrong information.
Now if one invokes iconv it is likely that garbage will be the result.

Andries


Here a Korean example.
http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
The http headers say Content-Type: text/plain; charset=iso-8859-1
(which is incorrect), an internal header says that this is ISO-2022-KR
(which is also incorrect), in fact the content is in EUC-KR.
That is none of wget's business, we want to save this file.
The headers say
Content-Disposition: attachment; 
filename=20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf
This encodes a valid utf-8 filename, and that name should be used.
So wget should save this file under the name
20101202_농심신라면배_바둑(다카오신지9단-백_.sgf



Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Rühsen
Am Freitag, 21. August 2015, 17:28:09 schrieb Andries E. Brouwer:
 On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote:
  On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote:
   Let me find some random site. Say
   http://web2go.board19.com/gopro/go_view.php?id=12345
  
  The server tell us the document is UTF-8.
  The document tell us it is 'UTF-8.
 
 And it is not. So - this example establishes that remote character set
 information, when present, is often unreliable.
 
 Let me add one more example,
 
 http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html
 
 a famous Danish recipe. The headers say Content-Type: text/html
 without revealing any character set.

1. There is no URL to parse in this document, so encoding does not matter 
anyway.

2. If the server AND the document do not explicitly specify the character 
encoding, there still is one - namely the default. Has been ISO-8859-1 a while 
ago. AFAIR, HTML5 might have changed that (too late for me now to look it up).

The is a good diagram - maybe not perfectly up-to-date but it still shows 
roughly how to operate:
http://nikitathespider.com/articles/EncodingDivination.html

 
   Moreover, the character set of a filename is in general unrelated
   to the character set of the contents of the file. That is most clear
   when the file is not a text file. What character set is the filename
   
   http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
  
  Wrong question. It is a JPEG file. Content doesn't matter to wget.
 
 Hmm. I thought the topic of our discussion was filenames and character sets.
 Here is a file, and its name is in ISO 8859-1.
 When wget saves it. What will the filename be?
 
  If you want to download the above mentioned web page and
  you have a UTF-8 locale, you have to tell wget via --local-encoding what
  encoding the URL is.
 
 Are you sure you do not mean --remote-encoding?

Yes, I am sure. Here my tests (my locale is UTF-8):

Wrong:
$ wget -nv --remote-encoding=iso-8859-1 
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
2015-08-21 20:09:30 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg 
[11690/11690] - kn�ckebr�d.jpg.1 [1]

Right:
http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg:
2015-08-21 20:14:18 FEHLER 404: Not Found.
2015-08-21 20:14:18 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg 
[11690/11690] - knäckebröd.jpg [1]


 But whatever you mean, it is an additional option.
 If the wget user already knows the character set, she can of course tell
 wget.
 
 The discussion is about the situation where the user does not know.
 
 So, that is the situation we are discussing: a remote site, the user
 does not know what encoding is used (she will find out after downloading),
 and the headers have either no information or wrong information.
 Now if one invokes iconv it is likely that garbage will be the result.


 Here a Korean example.
 http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
 The http headers say Content-Type: text/plain; charset=iso-8859-1
 (which is incorrect), an internal header says that this is ISO-2022-KR
 (which is also incorrect), in fact the content is in EUC-KR.
 That is none of wget's business, we want to save this file.
 The headers say
 Content-Disposition: attachment;
 filename=20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%
 EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%E
 B%B0%B1_.sgf This encodes a valid utf-8 filename, and that name should be
 used. So wget should save this file under the name
 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf

This is a different issue. Here we are talking about the encoding of HTTP 
headers, especially 'filename' values within Content-Disposition HTTP header.
The above is correctly encoded (UTF-8 percent encoding).

The encoding is described in RFC5987 (Character Set and Language Encoding for
 Hypertext Transfer Protocol (HTTP) Header Field Parameters).

Wget simply does not parse this correctly - it is just not coded in.
That is why support for Content-Disposition in Wget is documented as 
'experimental' (you have to explicitly enable it via --content-disposition).

Again the server encoding is known. Regarding filename encoding, nothing is 
wrong in your example. It is just Wget missing some code here (worth opening a 
separate bug).


Default Wget behavior:
$ wget -nv http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
2015-08-21 20:20:05 
URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] - 
1847B5314CF754B83134B7 [1]


Enabled Content-Disposition support:
$ wget -nv --content-disposition 
http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7
2015-08-21 20:23:50 
URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] - 
20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf
 
[1]

As we see, unescaping and UTF-8 to locale 

Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Andries E. Brouwer
On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote:

  Content-Disposition: attachment;
  filename=20101202_%EB...%A8-%EB%B0%B1_.sgf
  This encodes a valid utf-8 filename, and that name should be used.
  So wget should save this file under the name
  20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
 
 This is a different issue. Here we are talking about the encoding of HTTP 
 headers, especially 'filename' values within Content-Disposition HTTP header.
 Wget simply does not parse this correctly - it is just not coded in.
 It is just Wget missing some code here (worth opening a separate bug).

Good, saved for later.

 If the server AND the document do not explicitly specify the character 
 encoding, there still is one - namely the default. Has been ISO-8859-1
 a while ago. AFAIR, HTML5 might have changed that (too late for me now
 to look it up).

Yes - that is our main difference. You read the standard and find there
what everyone is supposed to do, or what the default is.
I download stuff from the net and encounter lots of things people do,
that are perhaps not according to the most recent standard,
and may differ from the default.

As a consequence I prefer to base the decision about what to do
on the form of the filename (ASCII / UTF-8 / other), not on the
headers encountered on the way to this file.

Fortunately, almost all URLs are in ASCII - no problem.
Fortunately, almost all that are not in ASCII, are UTF-8.
The good thing of UTF-8 is that it has a quite typical bit pattern.
A non-ASCII filename that is valid UTF-8 is very likely UTF-8.
So, one can recognize ASCII and UTF-8 rather reliably.

(By the way, I checked my conjecture that iconv from UTF-8
to UTF-8 need not be the identity map, and that is indeed the case.
On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.)

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-21 Thread Tim Ruehsen
On Friday 21 August 2015 02:08:43 Andries E. Brouwer wrote:
 On Thu, Aug 20, 2015 at 10:47:35AM +0200, Tim Ruehsen wrote:
  Basically, I keep track of the charset of each URL input
  (command line, input file, stdin, downloaded+scanned).
 
 It seems to me, you can't. Consider for example a command line
 that gives a URL hex escaped. Now the command line is pure ASCII
 and gives no information at all about the character set of the filename.

The charset is *not* determined (guessed) from the URL string, be it hex 
encoded or not. We take the locale setup as default, but it can be overridden 
by --local-encoding. Right now, Wget does not have the ability to have 
different encodings for file input (--input-file) and input via STDIN (when 
used at the same time). But that is another issue...

Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Andries E. Brouwer
On Wed, Aug 19, 2015 at 05:38:39PM +0300, Eli Zaretskii wrote:

  Assign a character set as follows:
  - if the user specified a from-charset, use that
  - if the name is printable ASCII (in 0x20-0x7f), take ASCII
  - if the name is non-ASCII and valid UTF-8, take UTF-8
  - otherwise take Unknown.
 
 I think this is simpler and produces the same results:
  - if the user specified a from-charset, use that
  - otherwise assume UTF-8

Simpler, but the results are not the same.

If the from-charset is unknown, then any call of iconv will certainly
lead to bad results. So there are only the two possibilities:
(i) leave as-is (if that is the user's preference)
(ii) make pure ASCII via hex escapes.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Eli Zaretskii
 From: Tim Ruehsen tim.rueh...@gmx.de
 Cc: Andries E. Brouwer andries.brou...@cwi.nl
 Date: Thu, 20 Aug 2015 10:47:35 +0200
 
  Tim says he has some/most of that coded on a branch, so I think we
  should start by merging that branch, and then take it from there.
 
 It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 
 'click on the merge button' to merge.
 Basically, I keep track of the charset of each URL input (command line, input 
 file, stdin, downloaded+scanned). So when generating the filename we have the 
 to and from charset. When iconv fails here (e.g. Chinese input, ASCII 
 output), 
 escaping takes place.

Sounds good to me.  Is something holding the merge of this to master?



Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Andries E. Brouwer
On Wed, Aug 19, 2015 at 09:46:04PM +0300, Eli Zaretskii wrote:

 OK, but how is this different from what we'd get using your suggested
 4 alternatives?

What can I reply? Just read my letter again.
I think I said what I wanted to say.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Tim Ruehsen
On Thursday 20 August 2015 17:39:09 Eli Zaretskii wrote:
  From: Tim Ruehsen tim.rueh...@gmx.de
  Cc: Andries E. Brouwer andries.brou...@cwi.nl
  Date: Thu, 20 Aug 2015 10:47:35 +0200
  
   Tim says he has some/most of that coded on a branch, so I think we
   should start by merging that branch, and then take it from there.
  
  It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can
  just 'click on the merge button' to merge.
  Basically, I keep track of the charset of each URL input (command line,
  input file, stdin, downloaded+scanned). So when generating the filename
  we have the to and from charset. When iconv fails here (e.g. Chinese
  input, ASCII output), escaping takes place.
 
 Sounds good to me.  Is something holding the merge of this to master?

Sorry it should have been so you *can't* just 'click on the merge button' to 
merge :-) I have to do some more organizational stuff over there before I 
introduce an official alpha version (but it is working already with a bunch of 
new features).

Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Andries E. Brouwer
On Wed, Aug 19, 2015 at 10:46:30PM +0300, Eli Zaretskii wrote:

 OK, then let me explain my line of reasoning.  Plain ASCII is valid
 UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
 know it's not valid UTF-8.  So the last 3 possibilities in your
 suggestion boil down to try converting as if it were UTF-8, and if
 that fails, you know it's Unknown.

Yes, although I would not invoke iconv to actually convert from UTF-8 to
UTF-8. Unicode is a complicated beast, and it is not certain that
conversion from UTF-8 to UTF-8 is the identity transformation.
(For example, implementations may prefer either NFC or NFD.
MacOS has its own NFD-like version for filenames.)
But you are right, one can use it as test.

After finding out that the charset is unknown I want to hex-encode
the entire filename. On the other hand, if the appropriate thing
is to invoke iconv to convert from one charset to another, I want
to hex-encode only the failing bytes.

This difference because (a) if there is reason to expect that
conversion should be possible, for example because the user
specified the from-charset as GB18030, and it fails, then often
only in a few isolated places where Microsoft extensions are used,
and it is more user-friendly to do the conversion where possible.
but (b) if nothing is known, then the character set can be a
multibyte one like SJIS where ASCII bytes occur as second halves
of symbols, and not escaping such ASCII bytes is confusing
and sometimes leads to strange problems.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Tim Ruehsen
On Wednesday 19 August 2015 17:38:39 Eli Zaretskii wrote:
  Date: Wed, 19 Aug 2015 02:52:57 +0200
  From: Andries E. Brouwer andries.brou...@cwi.nl
  Cc: bug-wget@gnu.org
  
  Look at the remote filename.
  
  Assign a character set as follows:
  - if the user specified a from-charset, use that
  - if the name is printable ASCII (in 0x20-0x7f), take ASCII
  - if the name is non-ASCII and valid UTF-8, take UTF-8
  - otherwise take Unknown.
 
 I think this is simpler and produces the same results:
  - if the user specified a from-charset, use that
  - otherwise assume UTF-8
 
  Determine a local character set as follows:
  - if the user specified a to-charset, use that
  - if the locale uses UTF-8, use that
  - otherwise take ASCII
 
 I suggest this instead:
  - if the user specified a to-charset, use that
  - otherwise, call nl_langinfo(CODESET) to find out the current
locale's encoding
 
  Convert the name from from-charset to to-charset:
  - if the user asked for unmodified filenames, do nothing
  - if the name is ASCII, do nothing
  - if the name is UTF-8 and the locale uses UTF-8, do nothing
  - convert from Unknown by hex-escaping the entire name
  - convert to ASCII by hex-escaping the entire name
  - otherwise invoke iconv(); upon failure, escape the illegal bytes
 
 My suggestion:
  - if the user asked for unmodified filenames, do nothing
  - else invoke 'iconv' to convert from remote to local encoding
  - if 'iconv' fails, convert to ASCII by hex-escaping
 
 Hex-escaping only the bytes that fail 'iconv' is better than
 hex-escaping all of them, but it's more complex, and I'm not sure it's
 worth the hassle.  But if it can be implemented without undue trouble,
 I'm all for it, as it will make wget more user-friendly in those
 cases.
 
  Once we know what we want it is trivial to write the code,
  but it may take a while to figure out what we want.
  I think we should start applying the current patch.
 
 Tim says he has some/most of that coded on a branch, so I think we
 should start by merging that branch, and then take it from there.

It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 
'click on the merge button' to merge.
Basically, I keep track of the charset of each URL input (command line, input 
file, stdin, downloaded+scanned). So when generating the filename we have the 
to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), 
escaping takes place.

Tim



Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Wed, 19 Aug 2015 02:52:57 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: bug-wget@gnu.org
 
 Look at the remote filename.
 
 Assign a character set as follows:
 - if the user specified a from-charset, use that
 - if the name is printable ASCII (in 0x20-0x7f), take ASCII
 - if the name is non-ASCII and valid UTF-8, take UTF-8
 - otherwise take Unknown.

I think this is simpler and produces the same results:
 - if the user specified a from-charset, use that
 - otherwise assume UTF-8

 Determine a local character set as follows:
 - if the user specified a to-charset, use that
 - if the locale uses UTF-8, use that
 - otherwise take ASCII

I suggest this instead:
 - if the user specified a to-charset, use that
 - otherwise, call nl_langinfo(CODESET) to find out the current
   locale's encoding

 Convert the name from from-charset to to-charset:
 - if the user asked for unmodified filenames, do nothing
 - if the name is ASCII, do nothing
 - if the name is UTF-8 and the locale uses UTF-8, do nothing
 - convert from Unknown by hex-escaping the entire name
 - convert to ASCII by hex-escaping the entire name
 - otherwise invoke iconv(); upon failure, escape the illegal bytes

My suggestion:
 - if the user asked for unmodified filenames, do nothing
 - else invoke 'iconv' to convert from remote to local encoding
 - if 'iconv' fails, convert to ASCII by hex-escaping

Hex-escaping only the bytes that fail 'iconv' is better than
hex-escaping all of them, but it's more complex, and I'm not sure it's
worth the hassle.  But if it can be implemented without undue trouble,
I'm all for it, as it will make wget more user-friendly in those
cases.

 Once we know what we want it is trivial to write the code,
 but it may take a while to figure out what we want.
 I think we should start applying the current patch.

Tim says he has some/most of that coded on a branch, so I think we
should start by merging that branch, and then take it from there.



Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 22:28:21 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
  What is needed to have a full Unicode support in wget on Windows is to
  provide replacements for all the file-name related libc functions
  ('fopen', 'open', 'stat', 'access', etc.) which will accept file names
  encoded in UTF-8, convert them internally into UTF-16, and call the
  wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
  '_waccess', etc.) with the converted file name.  Another thing that is
  needed is similar replacements for 'printf', 'puts', 'fprintf',
  etc. when they are used for writing file names to the console --
  because we cannot write UTF-8 sequences to the Windows console.
 
 Aha. That reminds me of a patch by I think Aleksey Bykov.
 Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html
 
 There we had a similar discussion, and he wrote mswindows.diff with
 
 +int 
 +wc_utime (unsigned char *filename, struct _utimbuf *times)
 +{
 +  wchar_t *w_filename;
 +  int buffer_size;
 +
 +  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, 
 -1, 
 w_filename, 0);
 +  w_filename = alloca (buffer_size);
 +  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
 +  return _wutime (w_filename, times);
 +}
 
 and similar for stat, open, etc. Something similar is what would be needed on 
 Windows?

Yes, thanks for pointing out those patches.  Any reasons they weren't
accepted back then?

 Is his patch usable?

It needs some minor polishing, but in general it should do the job,
yes.

I admit that I don't understand the need for the url.c patch.  Why do
we need to convert to wchar_t when the locale's codeset is already
UTF-8?  (I could understand that for non-UTF-8 locales, but the patch
explicitly limits the conversion to wchar_t and back to UTF-8 locales,
where the normal string functions should do the job.)  Is this only
for converting to upper/lower-case?

There's still the part with writing UTF-8 encoded file/URL names to
the Windows console; that will have to be added.



Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Wed, 19 Aug 2015 01:43:51 +0200
 From: Ángel González keis...@gmail.com
 
 +int
 +wc_utime (unsigned char *filename, struct _utimbuf *times)
 +{
 +  wchar_t *w_filename;
 +  int buffer_size;
 +
 +  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, 
 filename, -1, 
 w_filename, 0);
 +  w_filename = alloca (buffer_size);
 +  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
 +  return _wutime (w_filename, times);
 +}
 
 and similar for stat, open, etc. Something similar is what would be 
 needed on 
 Windows?
 Is his patch usable? Maybe I also commented a little in
 http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
 but after that nothing happened, it seems.
 
 That would probably work, but would need a review. On a quick look, some of 
 the functions have memory leaks (seems he first used malloc, then changed to 
 alloca just some of them).

Indeed.  Actually, there's no need to allocate memory dynamically,
neither will malloc nor with alloca, since Windows file names have
fixed size limitation that is known in advance.  So each conversion
function can use a fixed-sized local wchar_t array.  Doing that will
also avoid the need for 2 calls to MultiByteToWideChar, the first one
to find out how much space to allocate.

 And of course, there's the question of what to do if the filename we are 
 trying to convert to utf-16 is not in fact valid utf-8.

The calls to MultiByteToWideChar should use a flag
(MB_ERR_INVALID_CHARS) in its 2nd argument that makes the function
fail with a distinct error code in that case.  When it fails like
that, the wc_* wrappers should simply call the normal unibyte
functions with the original 'char *' argument.  This makes the
modified code fall back on previous behavior when the source file
names are not in UTF-8.

And regardless, wget should convert to the locale's codeset (on all
platforms).  Once the above patches are accepted, the Windows build
will pretend that its locale's codeset is UTF-8, and that will ensure
the conversions with MultiByteToWideChar will work in most situations.




Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Wed, 19 Aug 2015 20:50:55 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, keis...@gmail.com,
 bug-wget@gnu.org
 
 On Wed, Aug 19, 2015 at 09:46:04PM +0300, Eli Zaretskii wrote:
 
  OK, but how is this different from what we'd get using your suggested
  4 alternatives?
 
 What can I reply? Just read my letter again.
 I think I said what I wanted to say.

OK, then let me explain my line of reasoning.  Plain ASCII is valid
UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
know it's not valid UTF-8.  So the last 3 possibilities in your
suggestion boil down to try converting as if it were UTF-8, and if
that fails, you know it's Unknown.



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 11:58:54AM +0200, Tim Ruehsen wrote:

  Unix filenames are sequences of bytes, they do not have a character set.
 
 The character encoding makes with what symbols these bytes
 (or byte sequences aka multibyte / codepoints) are displayed for you.

Sure. So each time I load a different font, I see different glyphs
for my symbols. The file with single-byte name 0xff will look like
a Dutch ligature ij in some fonts, and quite different in other fonts.

The point is: it is the user's choice to load a font. (Or to set a locale.)
The filenames themselves do not carry additional information
about their character set.
For historical reasons a single directory can have files with names
in several character sets.

All this is about the local situation. One cannot know the character set
of a filename because that concept does not exist in Unix.
About the remote situation even less is known. It would be terrible
if wget decided to use obscure heuristics to invent a remote character set
and then invoke iconv.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 10:29:40AM +0200, Tim Ruehsen wrote:

 I am going with Eli that we should use iconv.
 We know the remote encoding and the local encoding

Do we?

How do you guess the remote encoding?
Is there any particular encoding?
Unix filenames are sequences of bytes, they do not have a character set.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Tim Ruehsen
On Monday 17 August 2015 22:51:12 Andries E. Brouwer wrote:
 On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote:
  what do we want to achieve here, and why is what wget did
  before your patch the wrong thing?
 
 Wget modified filenames, and users are unhappy.
 See
 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745
 http://savannah.gnu.org/bugs/?37564
 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
 http://stackoverflow.com/questions/27054765/wget-japanese-characters
 http://www.win.tue.nl/~aeb/linux/misc/wget.html
 etc.
 
 It is debatable what precisely would be the right thing,
 but my patch greatly increases the number of happy users.
 Further improvement is possible.
 For example, nothing was changed yet for Windows, but also
 Windows users complain about this wget escaping.

I am going with Eli that we should use iconv.
We know the remote encoding and the local encoding, so I don't see a problem 
here. There are a few cases (when using --input-file) where we have to tell 
wget the encoding via --remote-encoding.

On Windows we very often have the default locale Windows-1252 (aka CP1252) 
which is a superset of iso-8859-1. While web servers more and more often 
deliver content encoded as UTF-8. A UTF-8 filename of 'ö.html' (\C3x\B6x.html) 
should be saved as CP1252 ö.html (\F6x.html). If conversion is not possible 
due to characters not included into CP1252, we should fallback to escaping ( 
as improvement we could first try to convert codepoint by codepoint and just 
escape the ones not convertable).

This already done in 'wget2' branch where it can be tested (using src2/wget2). 
We just have to backport it to Wget 'master' branch. For me, this is just a 
matter of available time.

Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote:

 what do we want to achieve here, and why is what wget did
 before your patch the wrong thing?

Wget modified filenames, and users are unhappy.
See
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745
http://savannah.gnu.org/bugs/?37564
http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
http://stackoverflow.com/questions/27054765/wget-japanese-characters
http://www.win.tue.nl/~aeb/linux/misc/wget.html
etc.

It is debatable what precisely would be the right thing,
but my patch greatly increases the number of happy users.
Further improvement is possible.
For example, nothing was changed yet for Windows, but also
Windows users complain about this wget escaping.

Andries




Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 07:39:40PM +0300, Eli Zaretskii wrote:

  No. An exact copy allows me to decide what I have.
 
 Which is the heuristic you want this to be solved.  IMO, such a
 heuristic will not server most of the users in most of use cases.
 Users just want wget to DTRT automatically, and have the file names
 legible.

Let me see whether I understand you correctly.

You want to do the right thing. You think that the right thing
would be to invoke iconv. Since the original character set is
unknown to user and wget, you have to guess. What could one guess?
If the string is ASCII, fine. If the string is valid UTF-8, fine.
If the user has specified the character set, fine.
Otherwise? Leave it as it is?

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 07:43:05PM +0300, Eli Zaretskii wrote:

   If we convert the file names using iconv, Windows users will also be
   happier, at least when the remote URL can be encoded in their system
   codepage.
  
  Windows does not differ from Unix - since the remote character set
  is unknown and not necessarily constant, a conversion is impossible.
 
 Windows does differ from Unix, in that arbitrary byte sequences cannot
 be used in file names.

Of course. The code already tries to take care of that.

  See
 
   
 https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
 
 for the gory details.

Thanks for the reference!

  I already indicated the 1-line change that fixes the Windows problems.
 
 It doesn't, unfortunately.

You are too brief. What is wrong with the change that changes
/* insert some test for Windows */
into
return true;
?

That change only changes what wget does with bytes in the 128-159 range,
and reading the gory details I fail to see any problem. Almost the opposite:
Use any character in the current code page for a name, including Unicode 
characters
 and characters in the extended character set (128–255)
At first sight, if there were a problem it would be because of the clause
Any other character that the target file system does not allow.

Thanks to your reference I now feel confident to make that 1-line change
so that also Windows users are happy.

Andries


(There are restrictions involving filenames that wget perhaps does not enforce:
no LPT3, no final space or period, ... It might be useful to teach wget about
such details.)



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 19:51:58 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
 On Tue, Aug 18, 2015 at 07:43:05PM +0300, Eli Zaretskii wrote:
 
If we convert the file names using iconv, Windows users will also be
happier, at least when the remote URL can be encoded in their system
codepage.
   
   Windows does not differ from Unix - since the remote character set
   is unknown and not necessarily constant, a conversion is impossible.
  
  Windows does differ from Unix, in that arbitrary byte sequences cannot
  be used in file names.
 
 Of course. The code already tries to take care of that.

It does that badly.

   See
  

  https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
  
  for the gory details.
 
 Thanks for the reference!

You are welcome.

   I already indicated the 1-line change that fixes the Windows problems.
  
  It doesn't, unfortunately.
 
 You are too brief. What is wrong with the change that changes
 /* insert some test for Windows */
 into
 return true;
 ?

It preserves the current behavior, whereby almost every non-ASCII URL
out there gets saved in a file name that is either inaccessible to
localized programs, or shows as illegible mujibake.

 That change only changes what wget does with bytes in the 128-159 range,
 and reading the gory details I fail to see any problem. Almost the opposite:
 Use any character in the current code page for a name, including Unicode 
 characters
  and characters in the extended character set (128–255)

You need to read between the lines, as it's Microsoft speak.  First,
not every codepoint between 128 and 255 is valid in every codepage.
Second, Windows stores file names in UTF-16, so it attempts to convert
the byte stream into UTF-16 assuming the byte stream is in the current
codepage (which is incorrect in most cases, as we get UTF-8 instead).
The result is an utmost mess.

 Thanks to your reference I now feel confident to make that 1-line change
 so that also Windows users are happy.

Do you still think that?  Then allow me a small demonstration:

  D:\usr\eli\datawget 
https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
  --2015-08-18 21:23:38--  
https://ru.wikipedia.org/wiki/%D7%80%C2%A1%D7%80%C2%B5%D7%81%E2%82%AC%D7%80%C2%B4%D7%81%E2%80%A0%D7%80%C2%B5
  Loaded CA certificate 'd:/usr/etc/ssl/ca-bundle.crt'
  Resolving ru.wikipedia.org (ru.wikipedia.org)... 91.198.174.192
  Connecting to ru.wikipedia.org (ru.wikipedia.org)|91.198.174.192|:443... 
connected.
  HTTP request sent, awaiting response... 404 Not Found
  2015-08-18 21:23:39 ERROR 404: Not Found.

  --2015-08-18 21:23:39--  
https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
  Reusing existing connection to ru.wikipedia.org:443.
  HTTP request sent, awaiting response... 200 OK
  Length: unspecified [text/html]
  Saving to: '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'

  ╫%80┬í╫%80┬╡╫%81Γ%8 [ =  ] 180.32K   923KB/s   in 0.2s

  2015-08-18 21:23:40 (923 KB/s) - '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' 
saved [184652]

Do you really think that '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'
is a good way to express 'Сердце'?  Do you think someone will be able
to read and understand such a file name?  How would you go about
converting it back to what it should be?

 (There are restrictions involving filenames that wget perhaps does not 
 enforce:
 no LPT3, no final space or period, ... It might be useful to teach wget about
 such details.)

Indeed.  But that's a different issue, I think.




Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 09:15:40PM +0300, Eli Zaretskii wrote:

  Otherwise? Leave it as it is?

 No, encode it as %XX hex escapes, thus making the file name pure
 ASCII.  And have an option to leave it as is, so people who want
 that could have that.

OK, I can live with that.


On Tue, Aug 18, 2015 at 09:32:16PM +0300, Eli Zaretskii wrote:

 Second, Windows stores file names in UTF-16, so it attempts to convert
 the byte stream into UTF-16 assuming the byte stream is in the current
 codepage (which is incorrect in most cases, as we get UTF-8 instead).
 The result is an utmost mess.

Yes, conversion always leads to a problems.
So, I see that you want to use iconv to convert UTF-8 to the current
codepage, so that Windows can convert that to UTF-16 again.
As stated several times already I have zero experience on Windows,
but is it possible to let wget change its current codepage to Unicode
so that the Windows conversion is close to the identity map?
It seems silly to have a double conversion with data loss
if just a format conversion would suffice.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 21:32:16 +0300
 From: Eli Zaretskii e...@gnu.org
 Cc: bug-wget@gnu.org
 
   --2015-08-18 21:23:39--  
 https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
   Reusing existing connection to ru.wikipedia.org:443.
   HTTP request sent, awaiting response... 200 OK
   Length: unspecified [text/html]
   Saving to: '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'
 
   ╫%80┬í╫%80┬╡╫%81Γ%8 [ =  ] 180.32K   923KB/s   in 0.2s
 
   2015-08-18 21:23:40 (923 KB/s) - 
 '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' saved [184652]
 
 Do you really think that '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'
 is a good way to express 'Сердце'?  Do you think someone will be able
 to read and understand such a file name?  How would you go about
 converting it back to what it should be?

And of course the file name that is written is yet a different
mojibake: '׳%80ֲ¡׳%80ֲµ׳%81ג%82¬׳%80ֲ´׳%81ג%80 ׳%80ֲµ' (copied from the
directory listing displayed by UTF-16 capable Emacs).  Note that it
has right-to-left characters in it (probably because my locale is for
the Hebrew language), to make it even less legible due to display-time
reordering per the Unicode UAX#9.




Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 21:11:25 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
 On Tue, Aug 18, 2015 at 09:15:40PM +0300, Eli Zaretskii wrote:
 
   Otherwise? Leave it as it is?
 
  No, encode it as %XX hex escapes, thus making the file name pure
  ASCII.  And have an option to leave it as is, so people who want
  that could have that.
 
 OK, I can live with that.

Great, I'm glad we've found an agreeable compromise.

 So, I see that you want to use iconv to convert UTF-8 to the current
 codepage, so that Windows can convert that to UTF-16 again.

Yes.

 As stated several times already I have zero experience on Windows,
 but is it possible to let wget change its current codepage to Unicode
 so that the Windows conversion is close to the identity map?

No, it's not possible.  Windows does have a UTF-8 codepage, but it
doesn't allow setting that as the system codepage.

What is needed to have a full Unicode support in wget on Windows is to
provide replacements for all the file-name related libc functions
('fopen', 'open', 'stat', 'access', etc.) which will accept file names
encoded in UTF-8, convert them internally into UTF-16, and call the
wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
'_waccess', etc.) with the converted file name.  Another thing that is
needed is similar replacements for 'printf', 'puts', 'fprintf',
etc. when they are used for writing file names to the console --
because we cannot write UTF-8 sequences to the Windows console.  Doing
this is not rocket science (I did something similar for Emacs last
year), but more work than just a call to iconv that's needed on Unix.



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 10:31:31PM +0300, Eli Zaretskii wrote:

  Is it possible to let wget change its current codepage to Unicode
  so that the Windows conversion is close to the identity map?
 
 No, it's not possible.  Windows does have a UTF-8 codepage, but it
 doesn't allow setting that as the system codepage.
 
 What is needed to have a full Unicode support in wget on Windows is to
 provide replacements for all the file-name related libc functions
 ('fopen', 'open', 'stat', 'access', etc.) which will accept file names
 encoded in UTF-8, convert them internally into UTF-16, and call the
 wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
 '_waccess', etc.) with the converted file name.  Another thing that is
 needed is similar replacements for 'printf', 'puts', 'fprintf',
 etc. when they are used for writing file names to the console --
 because we cannot write UTF-8 sequences to the Windows console.

Aha. That reminds me of a patch by I think Aleksey Bykov.
Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html

There we had a similar discussion, and he wrote mswindows.diff with

+int 
+wc_utime (unsigned char *filename, struct _utimbuf *times)
+{
+  wchar_t *w_filename;
+  int buffer_size;
+
+  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, 
w_filename, 0);
+  w_filename = alloca (buffer_size);
+  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
+  return _wutime (w_filename, times);
+}

and similar for stat, open, etc. Something similar is what would be needed on 
Windows?
Is his patch usable? Maybe I also commented a little in
http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
but after that nothing happened, it seems.

Andries




Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Ángel González

On 18/08/15 22:28, Andries E. Brouwer wrote:

On Tue, Aug 18, 2015 at 10:31:31PM +0300, Eli Zaretskii wrote:

No, it's not possible.  Windows does have a UTF-8 codepage, but it
doesn't allow setting that as the system codepage.

What is needed to have a full Unicode support in wget on Windows is to
provide replacements for all the file-name related libc functions
('fopen', 'open', 'stat', 'access', etc.) which will accept file names
encoded in UTF-8, convert them internally into UTF-16, and call the
wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
'_waccess', etc.) with the converted file name.  Another thing that is
needed is similar replacements for 'printf', 'puts', 'fprintf',
etc. when they are used for writing file names to the console --
because we cannot write UTF-8 sequences to the Windows console.

Aha. That reminds me of a patch by I think Aleksey Bykov.
Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html

There we had a similar discussion, and he wrote mswindows.diff with

+int
+wc_utime (unsigned char *filename, struct _utimbuf *times)
+{
+  wchar_t *w_filename;
+  int buffer_size;
+
+  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, 
w_filename, 0);
+  w_filename = alloca (buffer_size);
+  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
+  return _wutime (w_filename, times);
+}

and similar for stat, open, etc. Something similar is what would be needed on 
Windows?
Is his patch usable? Maybe I also commented a little in
http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
but after that nothing happened, it seems.

Andries
That would probably work, but would need a review. On a quick look, some 
of the functions have memory leaks (seems he first used malloc, then 
changed to alloca just some of them).


And of course, there's the question of what to do if the filename we are 
trying to convert to utf-16 is not in fact valid utf-8.





Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Wed, Aug 19, 2015 at 01:43:51AM +0200, Ángel González wrote:

 And of course, there's the question of what to do if the filename we
 are trying to convert to utf-16 is not in fact valid utf-8.

My current understanding:

(i) there is a current patch, that fixes most problems on Unix
and can be applied today

(ii) one also wants to fix Windows problems, and in the process
do something more general for Unix. We can discuss a future
patch that does something like:

Look at the remote filename.

Assign a character set as follows:
- if the user specified a from-charset, use that
- if the name is printable ASCII (in 0x20-0x7f), take ASCII
- if the name is non-ASCII and valid UTF-8, take UTF-8
- otherwise take Unknown.

Determine a local character set as follows:
- if the user specified a to-charset, use that
- if the locale uses UTF-8, use that
- otherwise take ASCII

Convert the name from from-charset to to-charset:
- if the user asked for unmodified filenames, do nothing
- if the name is ASCII, do nothing
- if the name is UTF-8 and the locale uses UTF-8, do nothing
- convert from Unknown by hex-escaping the entire name
- convert to ASCII by hex-escaping the entire name
- otherwise invoke iconv(); upon failure, escape the illegal bytes

See whether the resulting name can be used. On Unix all strings
(without NUL and '/') are ok. On Windows there are many restrictions.
Further hex escape problematic characters on Windows.

Since conversions to 8-bit character sets will often fail,
it is desirable to convince Windows to use Unicode as current codeset.
Maybe that requires a copy of the common fileio routines.

That is my view of the result of the present conversation.
Probably some refinements will be needed. Moreover, there is
interference with iri stuff that should be looked at.

Once we know what we want it is trivial to write the code,
but it may take a while to figure out what we want.
I think we should start applying the current patch.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 12:55:50 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: bug-wget@gnu.org, Andries E. Brouwer andries.brou...@cwi.nl,
 Eli Zaretskii e...@gnu.org
 
 The point is: it is the user's choice to load a font. (Or to set a locale.)

Most users never change a locale, unless they are trying something
special, precisely because their file names will display as mujibake.
So wget should IMO by default cater to this use case, and allow saving
the bytes verbatim as an option.

 For historical reasons a single directory can have files with names
 in several character sets.

Again, this is a rare situation.  We shouldn't punish the majority on
behalf of such rare use cases.

 All this is about the local situation. One cannot know the character set
 of a filename because that concept does not exist in Unix.

Of course, it exists.  The _filesystem_ doesn't know it, but users do.

 About the remote situation even less is known.

Assuming UTF-8 will go a long way towards resolving this.  When this
is not so, we have the --remote-encoding switch.

 It would be terrible if wget decided to use obscure heuristics to
 invent a remote character set and then invoke iconv.

But what you suggest instead -- create a file name whose bytes are an
exact copy of the remote -- is just another heuristic.  And the
effects are no less terrible, because file names will become
illegible, especially on systems where UTF-8 is not the locale's
codeset.

I'm okay with having an option to do that, but it shouldn't be the
default, IMO.



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 17:28:34 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
   About the remote situation even less is known.
  
  Assuming UTF-8 will go a long way towards resolving this.  When this
  is not so, we have the --remote-encoding switch.
 
 This is wget. The user is recursively downloading a file hierarchy.
 Only after downloading does it become clear what one has got.

In some use cases, yes.  In most others, no: the encoding is known in
advance.

 I download a collection of East Asian texts on some topic.
 Upon examination, part is in SJIS, part in Big5, part in EUC-JP,
 part in UTF-8. Since the downloaded stuff does not have a uniform
 character set, and surely the server is not going to specify
 character sets, any invocation of iconv will corrupt my data.
 When I get the unmodified data I look using browser or editor
 or xterm+luit for which character set setting I get readable text.

I already said that wget should support this use case.  I just don't
think it should be the default.

   It would be terrible if wget decided to use obscure heuristics to
   invent a remote character set and then invoke iconv.
  
  But what you suggest instead -- create a file name whose bytes are an
  exact copy of the remote -- is just another heuristic.
 
 No. An exact copy allows me to decide what I have.

Which is the heuristic you want this to be solved.  IMO, such a
heuristic will not server most of the users in most of use cases.
Users just want wget to DTRT automatically, and have the file names
legible.

 Conversion leads to data loss.

When it does, or there's a risk that it does, users should use
optional features to countermand that.



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 17:56:30 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
   For example, nothing was changed yet for Windows, but also
   Windows users complain about this wget escaping.
  
  If we convert the file names using iconv, Windows users will also be
  happier, at least when the remote URL can be encoded in their system
  codepage.
 
 Windows does not differ from Unix - since the remote character set
 is unknown and not necessarily constant, a conversion is impossible.

Windows does differ from Unix, in that arbitrary byte sequences cannot
be used in file names.  See

  
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx

for the gory details.

 I already indicated the 1-line change that fixes the Windows problems.

It doesn't, unfortunately.



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 05:45:13PM +0300, Eli Zaretskii wrote:

  All this is about the local situation. One cannot know the character set
  of a filename because that concept does not exist in Unix.
 
 Of course, it exists.  The _filesystem_ doesn't know it, but users do.

Usually, yes.

  About the remote situation even less is known.
 
 Assuming UTF-8 will go a long way towards resolving this.  When this
 is not so, we have the --remote-encoding switch.

This is wget. The user is recursively downloading a file hierarchy.
Only after downloading does it become clear what one has got.

I download a collection of East Asian texts on some topic.
Upon examination, part is in SJIS, part in Big5, part in EUC-JP,
part in UTF-8. Since the downloaded stuff does not have a uniform
character set, and surely the server is not going to specify
character sets, any invocation of iconv will corrupt my data.
When I get the unmodified data I look using browser or editor
or xterm+luit for which character set setting I get readable text.

  It would be terrible if wget decided to use obscure heuristics to
  invent a remote character set and then invoke iconv.
 
 But what you suggest instead -- create a file name whose bytes are an
 exact copy of the remote -- is just another heuristic.

No. An exact copy allows me to decide what I have.
Conversion leads to data loss.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Andries E. Brouwer
On Tue, Aug 18, 2015 at 06:22:41PM +0300, Eli Zaretskii wrote:

  It is debatable what precisely would be the right thing,
  but my patch greatly increases the number of happy users.
 
 AFAIU, it does that only when the target locale is UTF-8.
 By using iconv we can make wget DTRT in more locales.

No, because wget, and the invoker of wget, does not know
the remote character set. And there need not be one.
A Chinese site often has a mixture of material in
Traditional Chinese and Simplified Chinese.
Any conversion would just make the stuff unreadable. 

  For example, nothing was changed yet for Windows, but also
  Windows users complain about this wget escaping.
 
 If we convert the file names using iconv, Windows users will also be
 happier, at least when the remote URL can be encoded in their system
 codepage.

Windows does not differ from Unix - since the remote character set
is unknown and not necessarily constant, a conversion is impossible.
I already indicated the 1-line change that fixes the Windows problems.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-18 Thread Eli Zaretskii
 Date: Mon, 17 Aug 2015 22:51:12 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
 On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote:
 
  what do we want to achieve here, and why is what wget did
  before your patch the wrong thing?
 
 Wget modified filenames, and users are unhappy.
 See
 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745
 http://savannah.gnu.org/bugs/?37564
 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
 http://stackoverflow.com/questions/27054765/wget-japanese-characters
 http://www.win.tue.nl/~aeb/linux/misc/wget.html
 etc.

There's no argument that wget currently doesn't cope well with these
cases.  The issue being discussed is what should it do instead.

 It is debatable what precisely would be the right thing,
 but my patch greatly increases the number of happy users.

AFAIU, it does that only when the target locale is UTF-8.  By using
iconv we can make wget DTRT in more locales.

 For example, nothing was changed yet for Windows, but also
 Windows users complain about this wget escaping.

If we convert the file names using iconv, Windows users will also be
happier, at least when the remote URL can be encoded in their system
codepage.  (To support characters outside of the system codepage,
deeper changes are needed in the Windows build of wget, for the
reasons I explained elsewhere in this thread.)



Re: [Bug-wget] bad filenames (again)

2015-08-17 Thread Andries E. Brouwer
On Mon, Aug 17, 2015 at 05:39:34AM +0300, Eli Zaretskii wrote:

(i) [about using setlocale]

   First, relying on UTF-8 locale to be announced in the environment
   is less portable than it could be: it's better to call 'setlocale'
   Then ... at least Cygwin will not be excluded from this feature.
  
  I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
  because I do not know anything about these platforms.
 
 These systems don't normally have the LC_* environment
 variables, and their 'setlocale' (with the exception of Cygwin) does
 not look at those variables.  But you _can_ obtain the current locale
 on all supported systems by calling 'setlocale'.

Good. Then perhaps using setlocale would be better.

I will not do so - do not feel confident on the Windows platform.
After all, the goal is not to find out what locale we are in,
but to find out whether it might be a good idea to escape certain
bytes in a filename. The original author's code was based on the
idea that the system is using an ISO-8859-n character set.
On Windows I guess that FAT filesystems will use some code page,
and NTFS filesystems will use Unicode.
If that is correct, then perhaps it never makes sense
to do this escape of high control bytes on a Windows system.

[So, I conjecture that we could make Windows users happy
by replacing
  /* insert some test for Windows */
by
  return true;
(and updating the functionname).]



(ii) [about possibly using iconv]

 How do you guess the original character set?

Since you pass silently over this point, it seems
there is no good way to involve iconv.


 This is a philosophical question: is a Cyrillic file name encoded in
 koi8-r and the same name encoded in UTF-8 a modified data or the
 same data expressed in different codesets.

Unix filenames are not necessarily in any particular character set.
They are sequences of bytes different from NUL and '/'.
A different sequence of bytes is a different filename.

Also, the same name encoded in UTF-8 is an optimistic description.
Should the Unicode be NFC? Or NFD? MacOS has a third version.
Even if the filename had a well-defined and known character set,
conversion to UTF-8 is not uniquely defined.

So, it seems to me that one cannot use iconv unless
--remote-encoding and --local-encoding have been specified
by the user. And if that is the case, then perhaps iconv
is already invoked (in the iri code; I have not checked the details).


Andries



Re: [Bug-wget] bad filenames (again)

2015-08-17 Thread Tim Ruehsen
On Thursday 13 August 2015 19:10:41 Andries E. Brouwer wrote:
 On Thu, Aug 13, 2015 at 05:54:57PM +0200, Tim Ruehsen wrote:
  I just made up a test case, but can't apply your patch.
 
  Please rebase to latest git master and generate your patch with
  git format-patch and send it as attachment. Thanks.

 OK, see attached.

 Andries

Based on that, and your proposal about the progress bar, I made up a bunch of
patches. The new test case is not yet ready.
@Andries: Maybe you can put a few more test cases into that (or send me a few
examples that should work). I also would like to see broken UTF-8 sequences in
this test.

@Darshit Could you have a closer look into the patches, please ? Neither is
python nor the progress code my playground... you are the expert here.

Tim
From 1ae1aeda78d83e570fe7ee5881c7e9caf182e991 Mon Sep 17 00:00:00 2001
From: Andries E. Brouwer a...@cwi.nl
Date: Thu, 13 Aug 2015 19:06:03 +0200
Subject: [PATCH 1/4] Do not escape high control bytes on a UTF-8 system.

---
 src/init.c| 26 +-
 src/options.h |  1 +
 src/url.c | 12 +---
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/src/init.c b/src/init.c
index ea074cc..6f71de1 100644
--- a/src/init.c
+++ b/src/init.c
@@ -348,6 +348,27 @@ command_by_name (const char *cmdname)
   return -1;
 }

+
+/* Used to determine whether bytes 128-159 are OK in a filename */
+static int
+have_utf8_locale() {
+#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
+  /* insert some test for Windows */
+#else
+  char *p;
+
+  p = getenv(LC_ALL);
+  if (p == NULL)
+p = getenv(LC_CTYPE);
+  if (p == NULL)
+p = getenv(LANG);
+  if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL ||
+  strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL)
+return true;
+#endif
+  return false;
+}
+
 /* Reset the variables to default values.  */
 void
 defaults (void)
@@ -419,6 +440,7 @@ defaults (void)
   opt.restrict_files_os = restrict_unix;
 #endif
   opt.restrict_files_ctrl = true;
+  opt.restrict_files_highctrl = (have_utf8_locale() ? false : true);
   opt.restrict_files_nonascii = false;
   opt.restrict_files_case = restrict_no_case_restriction;

@@ -1487,6 +1509,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno
 {
   int restrict_os = opt.restrict_files_os;
   int restrict_ctrl = opt.restrict_files_ctrl;
+  int restrict_highctrl = opt.restrict_files_highctrl;
   int restrict_case = opt.restrict_files_case;
   int restrict_nonascii = opt.restrict_files_nonascii;

@@ -1511,7 +1534,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno
   else if (VAL_IS (uppercase))
 restrict_case = restrict_uppercase;
   else if (VAL_IS (nocontrol))
-restrict_ctrl = false;
+restrict_ctrl = restrict_highctrl = false;
   else if (VAL_IS (ascii))
 restrict_nonascii = true;
   else
@@ -1532,6 +1555,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno

   opt.restrict_files_os = restrict_os;
   opt.restrict_files_ctrl = restrict_ctrl;
+  opt.restrict_files_highctrl = restrict_highctrl;
   opt.restrict_files_case = restrict_case;
   opt.restrict_files_nonascii = restrict_nonascii;

diff --git a/src/options.h b/src/options.h
index 24ddbb5..083d16b 100644
--- a/src/options.h
+++ b/src/options.h
@@ -251,6 +251,7 @@ struct options
   bool restrict_files_ctrl; /* non-zero if control chars in URLs
are restricted from appearing in
generated file names. */
+  bool restrict_files_highctrl; /* idem for bytes 128-159 */
   bool restrict_files_nonascii; /* non-zero if bytes with values greater
than 127 are restricted. */
   enum {
diff --git a/src/url.c b/src/url.c
index 73c8dd0..e98bfaa 100644
--- a/src/url.c
+++ b/src/url.c
@@ -1348,7 +1348,8 @@ enum {
   filechr_not_unix= 1,  /* unusable on Unix, / and \0 */
   filechr_not_vms = 2,  /* unusable on VMS (ODS5), 0x00-0x1F * ? */
   filechr_not_windows = 4,  /* unusable on Windows, one of \|/?:* */
-  filechr_control = 8   /* a control character, e.g. 0-31 */
+  filechr_control = 8,  /* a control character, e.g. 0-31 */
+  filechr_highcontrol = 16  /* a high control character, in 128-159 */
 };

 #define FILE_CHAR_TEST(c, mask) \
@@ -1360,6 +1361,7 @@ enum {
 #define V filechr_not_vms
 #define W filechr_not_windows
 #define C filechr_control
+#define Z filechr_highcontrol

 #define UVWC U|V|W|C
 #define UW U|W
@@ -1392,8 +1394,8 @@ UVWC, VC, VC, VC,  VC, VC, VC, VC,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   st   u   v   w   */
0,  0,  0,  0,   W,  0,  0,  C,   /* x   y   z   {|   }   ~   DEL */

-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 128-143 */
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 

Re: [Bug-wget] bad filenames (again)

2015-08-17 Thread Andries E. Brouwer
On Mon, Aug 17, 2015 at 01:17:06PM +0200, Tim Ruehsen wrote:

 @Andries: Maybe you can put a few more test cases into that
 (or send me a few examples that should work).
 I also would like to see broken UTF-8 sequences in this test.

By some coincidence NoëlKöthe just sent a bug report
that provides one more test case.

Fetch http://zh.wikipedia.org/wiki/%E9%A6%96%E9%A1%B5.

One hopes to get a file with file name 首页, that is,
with bytes e9 a6 96 e9 a1 b5, and that is what the patched wget gives.
The unpatched wget makes it (unpronounceable) with
bytes e9 a6 25 39 36 e9 a1 b5 (because the byte 96 was escaped into %96).

Andries



[Here it is clear what one wants. In examples with broken UTF-8
sequences, something will happen as a result of the present code.
It is unclear whether we want that or not. Changing the filename
is bad, but illegal utf-8 is also bad. Today I prefer the unchanged
filename, but see no need for a test that checks that we really get that.]



Re: [Bug-wget] bad filenames (again)

2015-08-17 Thread Andries E. Brouwer
On Mon, Aug 17, 2015 at 06:27:05PM +0300, Eli Zaretskii wrote:

 (ii) [about possibly using iconv]
 
 How do you guess the original character set?

 The answer is call nl_langinfo (CODESET).

I think we are not communicating.

wget fetches a file from a remote machine.
We know the filename (as a sequence of bytes).
As far as I can see, there is no information on what character set
(if any) that sequence of bytes might be in.

In order to call iconv, I need a from-charset and a to-charset.
I think your answer tells me how to find a reasonable to-charset.
But the problem is how to find a from-charset.

[Even when from-charset and to-charset are known there is
a can of worms involved in conversion. But without from-charset
one cannot even start thinking about conversion.]

  Unix filenames are not necessarily in any particular character set.
  They are sequences of bytes different from NUL and '/'.
  A different sequence of bytes is a different filename.
 
 As long as you treat them as UTF-8 encoded strings, ...

I don't understand how one can treat sequences of bytes
that are not valid UTF-8 as UTF-8 encoded strings.
If all the world is UTF-8 then fine. But the remote machine
is an unknown system. We just have a byte sequence, that is all.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-17 Thread Eli Zaretskii
 Date: Mon, 17 Aug 2015 12:59:05 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
 On Mon, Aug 17, 2015 at 05:39:34AM +0300, Eli Zaretskii wrote:
 
 (i) [about using setlocale]
 
First, relying on UTF-8 locale to be announced in the environment
is less portable than it could be: it's better to call 'setlocale'
Then ... at least Cygwin will not be excluded from this feature.
   
   I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
   because I do not know anything about these platforms.
  
  These systems don't normally have the LC_* environment
  variables, and their 'setlocale' (with the exception of Cygwin) does
  not look at those variables.  But you _can_ obtain the current locale
  on all supported systems by calling 'setlocale'.
 
 Good. Then perhaps using setlocale would be better.
 
 I will not do so - do not feel confident on the Windows platform.

You don't need to -- do it on your OS, and the same will work
elsewhere.

 After all, the goal is not to find out what locale we are in,
 but to find out whether it might be a good idea to escape certain
 bytes in a filename.

Indeed, you want the current locale's codeset, see below.

 On Windows I guess that FAT filesystems will use some code page,
 and NTFS filesystems will use Unicode.

Not exactly.  The functions that emulate Posix and accept file names
as char * strings cannot use Unicode on Windows, because using
Unicode means using wchar_t strings instead.  So, unless Someone™
changes wget to do that, at least on Windows, the Windows port will
still use the current system codepage, even on NTFS, because that's
what functions like 'fopen', 'open', 'stat', etc. assume.

 (ii) [about possibly using iconv]
 
  How do you guess the original character set?
 
 Since you pass silently over this point

No, I just missed that, sorry.

The answer is call nl_langinfo (CODESET).  Windows doesn't have
'nl_langinfo', but it is easily emulated with more or less a single
API call, or we could use the Gnulib replacement (which already does
support Windows).

 it seems there is no good way to involve iconv.

Actually, there's no problem, see above.  Many programs do it like
that already.

  This is a philosophical question: is a Cyrillic file name encoded in
  koi8-r and the same name encoded in UTF-8 a modified data or the
  same data expressed in different codesets.
 
 Unix filenames are not necessarily in any particular character set.
 They are sequences of bytes different from NUL and '/'.
 A different sequence of bytes is a different filename.

As long as you treat them as UTF-8 encoded strings, they are, for all
practical purposes, in the Unicode character set.  (Which, btw, brings
up the question what to do if the UTF-8 sequence is for u+FFFD or is
simply invalid -- do we treat them as control characters or don't we?)

 Also, the same name encoded in UTF-8 is an optimistic description.
 Should the Unicode be NFC? Or NFD? MacOS has a third version.

It doesn't matter, since any filesystem worth its sectors will DTRT
and any ls-like program will, too, and will show you a perfectly
legible file name.

 Even if the filename had a well-defined and known character set,
 conversion to UTF-8 is not uniquely defined.

Do whatever iconv does, and we will be fine.




Re: [Bug-wget] bad filenames (again)

2015-08-17 Thread Eli Zaretskii
 Date: Mon, 17 Aug 2015 19:58:31 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
 On Mon, Aug 17, 2015 at 06:27:05PM +0300, Eli Zaretskii wrote:
 
  (ii) [about possibly using iconv]
  
  How do you guess the original character set?
 
  The answer is call nl_langinfo (CODESET).
 
 I think we are not communicating.
 
 wget fetches a file from a remote machine.
 We know the filename (as a sequence of bytes).
 As far as I can see, there is no information on what character set
 (if any) that sequence of bytes might be in.

Then please explain why you started this thread by saying that the
byte sequence should end up unaltered in the filesystem (and wrote the
patch to do the same, AFAIU) if the target's locale uses UTF-8 as its
encoding.  What do you expect the file names to look like in 'ls' or
anything similar, after doing that?

 In order to call iconv, I need a from-charset and a to-charset.
 I think your answer tells me how to find a reasonable to-charset.
 But the problem is how to find a from-charset.

I thought the from-charset was UTF-8, or at least you assumed that.
If it isn't, I see even less sense in the idea of your patch, which is
basically writing the bytes unaltered.  Don't we want to try to have
on the target the same file names as on the source?  If not, what do
we want to achieve here, and why is what wget did before your patch
the wrong thing?

 [Even when from-charset and to-charset are known there is
 a can of worms involved in conversion.

No can of worms that I could see.  Either the conversion succeeds, or
it fails.  You get a clear indication from iconv about that.

   Unix filenames are not necessarily in any particular character set.
   They are sequences of bytes different from NUL and '/'.
   A different sequence of bytes is a different filename.
  
  As long as you treat them as UTF-8 encoded strings, ...
 
 I don't understand how one can treat sequences of bytes
 that are not valid UTF-8 as UTF-8 encoded strings.
 If all the world is UTF-8 then fine. But the remote machine
 is an unknown system. We just have a byte sequence, that is all.

If we know nothing about the source encoding, then the only sane thing
is to always hex-encode characters with 8th bit set.  But that's not
what your patch does.  It writes the byte stream verbatim to the
filesystem if the target locale uses UTF-8 as its codeset.  Please
explain the logic behind this, because I don't see it.



Re: [Bug-wget] bad filenames (again)

2015-08-16 Thread Eli Zaretskii
 Date: Thu, 13 Aug 2015 19:10:41 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: bug-wget@gnu.org, Andries E. Brouwer andries.brou...@cwi.nl
 
 +/* Used to determine whether bytes 128-159 are OK in a filename */
 +static int
 +have_utf8_locale() {
 +#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
 +  /* insert some test for Windows */
 +#else
 +  char *p;
 +
 +  p = getenv(LC_ALL);
 +  if (p == NULL)
 +p = getenv(LC_CTYPE);
 +  if (p == NULL)
 +p = getenv(LANG);
 +  if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL ||
 +  strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL)
 +return true;
 +#endif
 +  return false;
 +}
 [...]
 +  opt.restrict_files_highctrl = (have_utf8_locale() ? false : true);

I'm not sure this is the right way to fix this.  First, relying on
UTF-8 locale to be announced in the environment is less portable than
it could be: it's better to call 'setlocale' with the 2nd argument
NULL to glean the same information.  Then the ugly #ifdef above could
be dropped, and at least Cygwin will not be excluded from this
feature.

Moreover, even if the locale is not UTF-8, wget should attempt to
convert the file names to the current locale using iconv (which I
believe was what Tim suggested).  This will DTRT in much more cases
than the above UTF-8 centric approach, IMO.

Thanks.



Re: [Bug-wget] bad filenames (again)

2015-08-16 Thread Eli Zaretskii
 Date: Sun, 16 Aug 2015 22:21:20 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
 On Sun, Aug 16, 2015 at 05:43:50PM +0300, Eli Zaretskii wrote:
 
 (i)
 
  #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
/* insert some test for Windows */
  #else
   ... code that uses getenv to test LC_ALL, LC_CTYPE, LANG ...
  #endif
 
  I'm not sure this is the right way to fix this.  First, relying on
  UTF-8 locale to be announced in the environment is less portable than
  it could be: it's better to call 'setlocale' with the 2nd argument
  NULL to glean the same information.  Then the ugly #ifdef above could
  be dropped, and at least Cygwin will not be excluded from this
  feature.
 
 I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
 because I do not know anything about these platforms. It is quite
 possible that the #ifdef is unneeded.
 
 Are you saying that it in fact is needed when getenv() is used,
 but unneeded when setlocale() is used?

Yes.  These systems don't normally have the LC_* environment
variables, and their 'setlocale' (with the exception of Cygwin) does
not look at those variables.  But you _can_ obtain the current locale
on all supported systems by calling 'setlocale'.

 And then what about LANG?

What about it?  You can test it in the environment, if you want, but
IMO it's unnecessary, since either 'setlocale' already does, or the
variable is not relevant to the issue at hand.  (You need the codeset,
not the language.)

  Moreover, even if the locale is not UTF-8, wget should attempt to
  convert the file names to the current locale using iconv (which I
  believe was what Tim suggested).  This will DTRT in much more cases
  than the above UTF-8 centric approach, IMO.
 
 Hmm. My own point of view is almost the opposite. In my life I have
 spent countless hours trying to repair the damage done by software
 that helpfully modified my data.
 I prefer my data as-is, unless I explicitly ask for conversion.

This is a philosophical question: is a Cyrillic file name encoded in
koi8-r and the same name encoded in UTF-8 a modified data or the
same data expressed in different codesets.

Converting encoding as required by the locale is the expected
behavior.  Windows, for example, does that automatically (if
possible).

 The patch enlarges the number of cases where the original data
 is preserved. Yes, I am all in favour of enlarging that number of
 cases even further. This is only a first step. But in my eyes
 applying iconv would be a step back. It can be really tricky to
 decode the mojibake obtained by converting A to C, while
 the original really was in B.

If iconv succeeds to convert, you won't see any mojibake to begin
with.  If it fails, then yes, the conversion should be abandoned.

 What should happen when iconv() returns EILSEQ?

Turn on the restrict_files_highctrl option, like you do now.



Re: [Bug-wget] bad filenames (again)

2015-08-16 Thread Andries E. Brouwer
On Sun, Aug 16, 2015 at 05:43:50PM +0300, Eli Zaretskii wrote:

(i)

 #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
   /* insert some test for Windows */
 #else
  ... code that uses getenv to test LC_ALL, LC_CTYPE, LANG ...
 #endif

 I'm not sure this is the right way to fix this.  First, relying on
 UTF-8 locale to be announced in the environment is less portable than
 it could be: it's better to call 'setlocale' with the 2nd argument
 NULL to glean the same information.  Then the ugly #ifdef above could
 be dropped, and at least Cygwin will not be excluded from this
 feature.

I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
because I do not know anything about these platforms. It is quite
possible that the #ifdef is unneeded.

Are you saying that it in fact is needed when getenv() is used,
but unneeded when setlocale() is used? And then what about LANG?


(ii)

 Moreover, even if the locale is not UTF-8, wget should attempt to
 convert the file names to the current locale using iconv (which I
 believe was what Tim suggested).  This will DTRT in much more cases
 than the above UTF-8 centric approach, IMO.

Hmm. My own point of view is almost the opposite. In my life I have
spent countless hours trying to repair the damage done by software
that helpfully modified my data.
I prefer my data as-is, unless I explicitly ask for conversion.

I think Tim suggested something else (namely, just checking whether
the filename was valid UTF-8), but never mind.

The patch enlarges the number of cases where the original data
is preserved. Yes, I am all in favour of enlarging that number of
cases even further. This is only a first step. But in my eyes
applying iconv would be a step back. It can be really tricky to
decode the mojibake obtained by converting A to C, while
the original really was in B.
How do you guess the original character set?
What should happen when iconv() returns EILSEQ?


Andries





Re: [Bug-wget] bad filenames (again)

2015-08-15 Thread Darshit Shah
I guess this issue is now closed? We should document libgpgme11-dev as
a dependency.

On Fri, Aug 14, 2015 at 1:38 AM, Tim Rühsen tim.rueh...@gmx.de wrote:
 Am Donnerstag, 13. August 2015, 19:33:56 schrieb Andries E. Brouwer:
 After git clone, one gets a wget tree without autogenerated files.
 README.checkout tells one to run ./bootstrap to generate configure.

 But:

 $ ./bootstrap
 ./bootstrap: Bootstrapping from checked-out wget sources...
 ./bootstrap: consider installing git-merge-changelog from gnulib
 ./bootstrap: getting gnulib files...
 ...

 running: AUTOPOINT=true LIBTOOLIZE=true autoreconf --verbose --install
 --force -I m4  --no-recursive autoreconf: Entering directory `.'
 autoreconf: running: true --force
 autoreconf: running: aclocal -I m4 --force -I m4
 configure.ac:498: warning: macro 'AM_PATH_GPGME' not found in library
 autoreconf: configure.ac: tracing
 autoreconf: configure.ac: not using Libtool
 autoreconf: running: /usr/bin/autoconf --include=m4 --force
 configure.ac:93: error: possibly undefined macro: AC_DEFINE
   If this token and others are legitimate, please use m4_pattern_allow.
   See the Autoconf documentation.
 configure.ac:498: error: possibly undefined macro: AM_PATH_GPGME
 autoreconf: /usr/bin/autoconf failed with exit status: 1
 ./bootstrap: autoreconf failed

 Yes sorry, that is a recent issue with metalink. Darshit works on that.

 You have to install libgpgme11-dev (Or similar name).

 Tim




-- 
Thanking You,
Darshit Shah
From b495c71adc88642d06f141c612f82ba10bdb7ee1 Mon Sep 17 00:00:00 2001
From: Darshit Shah dar...@gmail.com
Date: Sat, 15 Aug 2015 12:22:33 +0530
Subject: [PATCH] Document dependency on libgpgme11-dev

* README.checkout: Document dependency on libgpgme11-dev required by
the metalink code.
---
 README.checkout | 5 +
 1 file changed, 5 insertions(+)

diff --git a/README.checkout b/README.checkout
index 03463d1..eff6abc 100644
--- a/README.checkout
+++ b/README.checkout
@@ -94,6 +94,10 @@ Compiling From Repository Sources
saved the .pc file. Example:
$ PKG_CONFIG_PATH=. ./configure
 
+* [46]libgpgme11-dev is required to compile with support for metalink files
+  and GPGME support. Metalink requires this library to verify the integrity
+  of the download.
+
 
For those who might be confused as to what to do once they check out
the source code, considering configure and Makefile do not yet exist at
@@ -200,3 +204,4 @@ References
   43. http://validator.w3.org/check?uri=referer
   44. http://wget.addictivecode.org/WikiLicense
   45. https://www.python.org/
+  46. https://www.gnupg.org/%28it%29/related_software/gpgme/index.html
-- 
2.5.0



Re: [Bug-wget] bad filenames (again)

2015-08-13 Thread Andries E. Brouwer
On Thu, Aug 13, 2015 at 05:54:57PM +0200, Tim Ruehsen wrote:

 I just made up a test case, but can't apply your patch.
 
 Please rebase to latest git master and generate your patch with
 git format-patch and send it as attachment. Thanks.

OK, see attached.

Andries
From 5980a3665d8924c7d2374f0740bb82ff0cdc9043 Mon Sep 17 00:00:00 2001
From: Andries E. Brouwer a...@cwi.nl
Date: Thu, 13 Aug 2015 19:06:03 +0200
Subject: [PATCH] Do not escape high control bytes on a UTF-8 system.

---
 src/init.c| 26 +-
 src/options.h |  1 +
 src/url.c | 12 +---
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/src/init.c b/src/init.c
index ea074cc..6f71de1 100644
--- a/src/init.c
+++ b/src/init.c
@@ -348,6 +348,27 @@ command_by_name (const char *cmdname)
   return -1;
 }
 
+
+/* Used to determine whether bytes 128-159 are OK in a filename */
+static int
+have_utf8_locale() {
+#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
+  /* insert some test for Windows */
+#else
+  char *p;
+
+  p = getenv(LC_ALL);
+  if (p == NULL)
+p = getenv(LC_CTYPE);
+  if (p == NULL)
+p = getenv(LANG);
+  if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL ||
+  strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL)
+return true;
+#endif
+  return false;
+}
+
 /* Reset the variables to default values.  */
 void
 defaults (void)
@@ -419,6 +440,7 @@ defaults (void)
   opt.restrict_files_os = restrict_unix;
 #endif
   opt.restrict_files_ctrl = true;
+  opt.restrict_files_highctrl = (have_utf8_locale() ? false : true);
   opt.restrict_files_nonascii = false;
   opt.restrict_files_case = restrict_no_case_restriction;
 
@@ -1487,6 +1509,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno
 {
   int restrict_os = opt.restrict_files_os;
   int restrict_ctrl = opt.restrict_files_ctrl;
+  int restrict_highctrl = opt.restrict_files_highctrl;
   int restrict_case = opt.restrict_files_case;
   int restrict_nonascii = opt.restrict_files_nonascii;
 
@@ -1511,7 +1534,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno
   else if (VAL_IS (uppercase))
 restrict_case = restrict_uppercase;
   else if (VAL_IS (nocontrol))
-restrict_ctrl = false;
+restrict_ctrl = restrict_highctrl = false;
   else if (VAL_IS (ascii))
 restrict_nonascii = true;
   else
@@ -1532,6 +1555,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno
 
   opt.restrict_files_os = restrict_os;
   opt.restrict_files_ctrl = restrict_ctrl;
+  opt.restrict_files_highctrl = restrict_highctrl;
   opt.restrict_files_case = restrict_case;
   opt.restrict_files_nonascii = restrict_nonascii;
 
diff --git a/src/options.h b/src/options.h
index 24ddbb5..083d16b 100644
--- a/src/options.h
+++ b/src/options.h
@@ -251,6 +251,7 @@ struct options
   bool restrict_files_ctrl; /* non-zero if control chars in URLs
are restricted from appearing in
generated file names. */
+  bool restrict_files_highctrl; /* idem for bytes 128-159 */
   bool restrict_files_nonascii; /* non-zero if bytes with values greater
than 127 are restricted. */
   enum {
diff --git a/src/url.c b/src/url.c
index 73c8dd0..e98bfaa 100644
--- a/src/url.c
+++ b/src/url.c
@@ -1348,7 +1348,8 @@ enum {
   filechr_not_unix= 1,  /* unusable on Unix, / and \0 */
   filechr_not_vms = 2,  /* unusable on VMS (ODS5), 0x00-0x1F * ? */
   filechr_not_windows = 4,  /* unusable on Windows, one of \|/?:* */
-  filechr_control = 8   /* a control character, e.g. 0-31 */
+  filechr_control = 8,  /* a control character, e.g. 0-31 */
+  filechr_highcontrol = 16  /* a high control character, in 128-159 */
 };
 
 #define FILE_CHAR_TEST(c, mask) \
@@ -1360,6 +1361,7 @@ enum {
 #define V filechr_not_vms
 #define W filechr_not_windows
 #define C filechr_control
+#define Z filechr_highcontrol
 
 #define UVWC U|V|W|C
 #define UW U|W
@@ -1392,8 +1394,8 @@ UVWC, VC, VC, VC,  VC, VC, VC, VC,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   st   u   v   w   */
0,  0,  0,  0,   W,  0,  0,  C,   /* x   y   z   {|   }   ~   DEL */
 
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 128-143 */
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 144-159 */
+  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z, /* 128-143 */
+  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z, /* 144-159 */
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
 
@@ -1406,6 +1408,7 @@ UVWC, VC, VC, VC,  VC, VC, VC, VC,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
 #undef V
 #undef W
 #undef C
+#undef Z
 #undef UW
 #undef UVWC
 #undef VC
@@ -1448,8 +1451,11 @@ append_uri_pathel (const char *b, const char *e, bool 

Re: [Bug-wget] bad filenames (again)

2015-08-13 Thread Tim Rühsen
Am Donnerstag, 13. August 2015, 19:33:56 schrieb Andries E. Brouwer:
 After git clone, one gets a wget tree without autogenerated files.
 README.checkout tells one to run ./bootstrap to generate configure.
 
 But:
 
 $ ./bootstrap
 ./bootstrap: Bootstrapping from checked-out wget sources...
 ./bootstrap: consider installing git-merge-changelog from gnulib
 ./bootstrap: getting gnulib files...
 ...
 
 running: AUTOPOINT=true LIBTOOLIZE=true autoreconf --verbose --install
 --force -I m4  --no-recursive autoreconf: Entering directory `.'
 autoreconf: running: true --force
 autoreconf: running: aclocal -I m4 --force -I m4
 configure.ac:498: warning: macro 'AM_PATH_GPGME' not found in library
 autoreconf: configure.ac: tracing
 autoreconf: configure.ac: not using Libtool
 autoreconf: running: /usr/bin/autoconf --include=m4 --force
 configure.ac:93: error: possibly undefined macro: AC_DEFINE
   If this token and others are legitimate, please use m4_pattern_allow.
   See the Autoconf documentation.
 configure.ac:498: error: possibly undefined macro: AM_PATH_GPGME
 autoreconf: /usr/bin/autoconf failed with exit status: 1
 ./bootstrap: autoreconf failed

Yes sorry, that is a recent issue with metalink. Darshit works on that.

You have to install libgpgme11-dev (Or similar name).

Tim




Re: [Bug-wget] bad filenames (again)

2015-08-13 Thread Tim Ruehsen
Hi Andries,

I just made up a test case, but can't apply your patch.

Please rebase to latest git master and generate your patch with
git format-patch and send it as attachment. Thanks.

Regards, Tim

On Wednesday 12 August 2015 19:36:52 Andries E. Brouwer wrote:
 On Wed, Aug 12, 2015 at 05:54:25PM +0200, Tim Ruehsen wrote:
  OK. Let's set up a test where we define input and expected output.
  If that works, I am fine.
 
 OK. I mentioned a Hebrew example, but in order to avoid
 the additional difficulty of bidi text, let me find a
 Russian example instead.
 
 % wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
 Saving to: ‘Се\321%80д\321%86е’
 
 % my_wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
 Saving to: ‘Сердце’
 
 (This is the Russian Wikipedia page for 'heart').
 
 Andries
 
 
 ---
 
 BTW - now that I tried this: the progress bar contains an ugly symbol.
 Looking at progress.c I see
 
   int padding = MAX_FILENAME_COLS - orig_filename_cols;
   sprintf (p, %s , bp-f_download);
   p += orig_filename_cols + 1;
   for (;padding;padding--)
 *p++ = ' ';
 
 but orig_filename_cols was computed correctly, counting character
 positions, not bytes, and the
   p += orig_filename_cols + 1;
 is a bug.
 The ugly symbol is because a multibyte character was truncated.
 
 If I write
 
   sprintf (p, %s , bp-f_download);
   p += strlen(bp-f_download) + 1;
   while (p  bp-buffer + MAX_FILENAME_COLS)
 *p++ = ' ';
 
 instead, then the progress bar text looks right in this particular case.
 I have not yet read the surrounding code.




Re: [Bug-wget] bad filenames (again)

2015-08-12 Thread Tim Ruehsen
On Wednesday 12 August 2015 14:38:15 Andries E. Brouwer wrote:
 Hi Tim,
 
  Just a few questions.
  
  1.
  Why don't you use 'opt.locale' to check if the local encoding is UTF-8 ?
 
 I thought that was usable only if ENABLE_IRI was defined.

I see. ENABLE_IRI, libiconv (for conversion) and libidn (used for setting 
opt.locale) are tightly coupled. Understandable that you won't go into that 
swamp.

  2.
  I don't understand how you distinguish between illegal and legal UTF-8
  sequences. I guess only legal sequences should be unescaped.
  Or to make it easy: if the string is valid UTF-8, do not escape.
  If it is not valid UTF-8, escape it.
  You could:
  Add unistr/u8-check to bootstrap.conf (./bootstrap thereafter),
  include #include unistr.h and use
  if (u8_check (s, strlen(s)) == 0) to test for validity.
 
 Yes, I expected you to say something like this.
 
 My reason: I consider this escaping a very doubtful activity.
 In my eyes the correct code is not: always escape except when UTF-8,
 but rather: never escape except perhaps when someone asks for it.
 So the precise check for UTF-8 is in my eyes just bloat.

Of course, only when someone asks (in this special case).
But the user should *really* know what he is doing, else the requested 'not-
escaping' becomes an epic fail.

 Moreover: what to do if the name is not valid UTF-8?
 The current escaping produces something that not valid UTF-8.
 So doing the current escaping is certainly a mistake, not better
 than using the name as-is. Invent a new type of escaping?

The procedure should be (simplified):
When extracting an URL from a document, we know it's encoding. When we 
generate a filename from this URL we should (and can) convert to local 
encoding first, then generate the filename. If this fails (likely iconv() 
problem), we start escaping regarding the user's wish (except the user does 
explicitly not want escaping).

 So, for the time being, my previous patch avoided the old mistake,
 without introducing new mistakes :-).

OK. Let's set up a test where we define input and expected output.
If that works, I am fine.

Regards, Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] bad filenames (again)

2015-08-12 Thread Andries E. Brouwer
Hi Tim,

 Just a few questions.
 
 1.
 Why don't you use 'opt.locale' to check if the local encoding is UTF-8 ?

I thought that was usable only if ENABLE_IRI was defined.

 2. 
 I don't understand how you distinguish between illegal and legal UTF-8 
 sequences. I guess only legal sequences should be unescaped. 
 Or to make it easy: if the string is valid UTF-8, do not escape.
 If it is not valid UTF-8, escape it.
 You could:
 Add unistr/u8-check to bootstrap.conf (./bootstrap thereafter),
 include #include unistr.h and use
 if (u8_check (s, strlen(s)) == 0) to test for validity.

Yes, I expected you to say something like this.

My reason: I consider this escaping a very doubtful activity.
In my eyes the correct code is not: always escape except when UTF-8,
but rather: never escape except perhaps when someone asks for it.
So the precise check for UTF-8 is in my eyes just bloat.

Moreover: what to do if the name is not valid UTF-8?
The current escaping produces something that not valid UTF-8.
So doing the current escaping is certainly a mistake, not better
than using the name as-is. Invent a new type of escaping?

So, for the time being, my previous patch avoided the old mistake,
without introducing new mistakes :-).

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-12 Thread Andries E. Brouwer
On Wed, Aug 12, 2015 at 05:54:25PM +0200, Tim Ruehsen wrote:

 OK. Let's set up a test where we define input and expected output.
 If that works, I am fine.

OK. I mentioned a Hebrew example, but in order to avoid
the additional difficulty of bidi text, let me find a
Russian example instead.

% wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
Saving to: ‘Се\321%80д\321%86е’

% my_wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5
Saving to: ‘Сердце’

(This is the Russian Wikipedia page for 'heart').

Andries


---

BTW - now that I tried this: the progress bar contains an ugly symbol.
Looking at progress.c I see

  int padding = MAX_FILENAME_COLS - orig_filename_cols;
  sprintf (p, %s , bp-f_download);
  p += orig_filename_cols + 1;
  for (;padding;padding--)
*p++ = ' ';

but orig_filename_cols was computed correctly, counting character
positions, not bytes, and the
  p += orig_filename_cols + 1;
is a bug.
The ugly symbol is because a multibyte character was truncated.

If I write

  sprintf (p, %s , bp-f_download);
  p += strlen(bp-f_download) + 1;
  while (p  bp-buffer + MAX_FILENAME_COLS)
*p++ = ' ';

instead, then the progress bar text looks right in this particular case.
I have not yet read the surrounding code.



Re: [Bug-wget] bad filenames (again)

2015-08-09 Thread Andries E. Brouwer
On Fri, Aug 07, 2015 at 05:13:19PM +0200, Tim Ruehsen wrote:

 The solution would something like
 
 if locale is UTF-8
   do not escape valid UTF-8 sequences
 else
   keep wget's current behavior

 If you provide patch for this we will appreciate that.

OK - a first version of such a patch.
This splits the restrict_control into two halves.
The low control is as before.
The high control is permitted by default on a Unix system
with something that looks like an UTF-8 locale.
For Windows the behavior is unchanged.

Andries

Test: fetch http://he.wikipedia.org/wiki/הרפש_.ש


diff -ru wget-1.16.3/src/init.c wget-1.16.3a/src/init.c
--- wget-1.16.3/src/init.c  2015-01-31 00:25:57.0 +0100
+++ wget-1.16.3a/src/init.c 2015-08-09 21:44:54.260215105 +0200
@@ -333,6 +333,27 @@
   return -1;
 }
 
+
+/* Used to determine whether bytes 128-159 are OK in a filename */
+static int
+have_utf8_locale() {
+#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
+  /* insert some test for Windows */
+#else
+  char *p;
+
+  p = getenv(LC_ALL);
+  if (p == NULL)
+p = getenv(LC_CTYPE);
+  if (p == NULL)
+p = getenv(LANG);
+  if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL ||
+  strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL)
+return true;
+#endif
+  return false;
+}
+
 /* Reset the variables to default values.  */
 void
 defaults (void)
@@ -401,6 +422,7 @@
   opt.restrict_files_os = restrict_unix;
 #endif
   opt.restrict_files_ctrl = true;
+  opt.restrict_files_highctrl = (have_utf8_locale() ? false : true);
   opt.restrict_files_nonascii = false;
   opt.restrict_files_case = restrict_no_case_restriction;
 
@@ -1466,6 +1488,7 @@
 {
   int restrict_os = opt.restrict_files_os;
   int restrict_ctrl = opt.restrict_files_ctrl;
+  int restrict_highctrl = opt.restrict_files_highctrl;
   int restrict_case = opt.restrict_files_case;
   int restrict_nonascii = opt.restrict_files_nonascii;
 
@@ -1488,7 +1511,7 @@
   else if (VAL_IS (uppercase))
 restrict_case = restrict_uppercase;
   else if (VAL_IS (nocontrol))
-restrict_ctrl = false;
+restrict_ctrl = restrict_highctrl = false;
   else if (VAL_IS (ascii))
 restrict_nonascii = true;
   else
@@ -1509,6 +1532,7 @@
 
   opt.restrict_files_os = restrict_os;
   opt.restrict_files_ctrl = restrict_ctrl;
+  opt.restrict_files_highctrl = restrict_highctrl;
   opt.restrict_files_case = restrict_case;
   opt.restrict_files_nonascii = restrict_nonascii;
 
diff -ru wget-1.16.3/src/options.h wget-1.16.3a/src/options.h
--- wget-1.16.3/src/options.h   2015-01-31 00:25:57.0 +0100
+++ wget-1.16.3a/src/options.h  2015-08-09 21:22:35.984186065 +0200
@@ -244,6 +244,7 @@
   bool restrict_files_ctrl; /* non-zero if control chars in URLs
are restricted from appearing in
generated file names. */
+  bool restrict_files_highctrl; /* idem for bytes 128-159 */
   bool restrict_files_nonascii; /* non-zero if bytes with values greater
than 127 are restricted. */
   enum {
diff -ru wget-1.16.3/src/url.c wget-1.16.3a/src/url.c
--- wget-1.16.3/src/url.c   2015-02-23 16:10:22.0 +0100
+++ wget-1.16.3a/src/url.c  2015-08-09 21:14:34.876175626 +0200
@@ -1329,7 +1329,8 @@
 enum {
   filechr_not_unix= 1,  /* unusable on Unix, / and \0 */
   filechr_not_windows = 2,  /* unusable on Windows, one of \|/?:* */
-  filechr_control = 4   /* a control character, e.g. 0-31 */
+  filechr_control = 4,  /* a control character, e.g. 0-31 */
+  filechr_highcontrol = 8  /* a high control character, in 128-159 */
 };
 
 #define FILE_CHAR_TEST(c, mask) \
@@ -1340,6 +1341,7 @@
 #define U filechr_not_unix
 #define W filechr_not_windows
 #define C filechr_control
+#define Z filechr_highcontrol
 
 #define UW U|W
 #define UWC U|W|C
@@ -1370,8 +1372,8 @@
   0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   st   u   v   w   */
   0,  0,  0,  0,   W,  0,  0,  C,   /* x   y   z   {|   }   ~   DEL */
 
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 128-143 */
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 144-159 */
+  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z, /* 128-143 */
+  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z,  Z, Z, Z, Z, /* 144-159 */
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
 
@@ -1383,6 +1385,7 @@
 #undef U
 #undef W
 #undef C
+#undef Z
 #undef UW
 #undef UWC
 
@@ -1417,8 +1420,11 @@
 mask = filechr_not_unix;
   else
 mask = filechr_not_windows;
+
   if (opt.restrict_files_ctrl)
 mask |= filechr_control;
+  if (opt.restrict_files_highctrl)
+mask |= filechr_highcontrol;
 
   /* Copy [b, e) to PATHEL and URL-unescape it. */
   if (escaped)




Re: [Bug-wget] bad filenames (again)

2015-08-07 Thread Tim Ruehsen
Hi Andries,

as I already mentioned, changing the default behavior of wget is not a good 
idea.

But I started a wget2 branch that produces wget and wget2 executables.
wget2's default behavior is to keep filenames as they are.

I am not sure how it compiles and works on Windows (Cygwin could work).
If you dare to check it out: any feedback is highly welcome.

Regards, Tim

On Thursday 06 August 2015 23:40:45 Andries E. Brouwer wrote:
 Today I again downloaded a large tree with wget and got only unusable
 filenames. Fortunately I have the utility wgetfix that repairs the
 consequences of this bug (see
 http://www.win.tue.nl/~aeb/linux/misc/wget.html ), but nevertheless this
 wget bug should be fixed.
 
 (Maybe it has been fixed already? I looked at this in detail last year,
 and there was some correspondence but I think nothing happened.
 Have not looked at the latest sources.)
 
 What happens is that wget under certain circumstances escapes
 certain bytes in a filename. I think that this was always a mistake,
 but it did not occur very much and was defendable: filenames with
 embedded control characters are a pain.
 
 Today the situation is just the opposite: when copying from a remote
 utf8 system to a local utf8 system correct and normal filenames
 are escaped to create illegal filenames that cannot be used
 and are worse than a pain, one cannot do much else than discard them.
 
 What can the user do?
 
 If she is on Windows, she is told to switch to Linux:
  I can't help Windows users, but Wget is a power-user tool.
  And a Windows power-user should be able to start a virtual
  machine with Linux running to use tools like Wget.
 
 Is she is on Linux, the easiest is to discard all that was downloaded
 and start over again, this time with the option
 --restrict-file-names=nocontrol
 
 If the user knows about wgetfix, that is an alternative.
 
 One can also use curl instead of wget.
 
 See also
 
 http://savannah.gnu.org/bugs/?37564
 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
 http://stackoverflow.com/questions/27054765/wget-japanese-characters
 http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-usin
 g-wget http://www.win.tue.nl/~aeb/linux/misc/wget.html
 
 Below I suggested an easy fix, and discussed some details.
 
 Andries
 
 On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote:
  On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
   On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
   If I ask wget to download the wikipedia page
   
   http://he.wikipedia.org/wiki/ש._שפרה
   
   then I hope for a resulting file ש._שפרה.
   Instead, wget gives me ש._שפר\327%94, where the \327
   is an unpronounceable byte that cannot be typed
   (This is an UTF-8 system and the filename
   that wget produces is not valid UTF-8.)
   
   Maybe it would be better if wget by default used the original filename.
   This name mangling is a vestige of old times, it seems to me.
   
   This is a commonly reported grievance and as you correctly mention a
   vestige of old times. With UTF-8 supported filesystems, Wget should
   simply write the correct characters.
   
   I sincerely hope this issue is resolved as fast as possible, but I
   know not how to. Those who understand i18n should work on this.
  
  It is very easy to resolve the issue, but I don't know how backwards
  compatible the wget developers want to be.
  
  The easiest solution is to change the line (in init.c:defaults())
  
  opt.restrict_files_ctrl = true;
  
  into
  
  opt.restrict_files_ctrl = false;
  
  That is what I would like to see:
  the default should be to preserve the name as-is,
  and there should be options escape_control or so
  to force the current default behaviour.
  
  There are also more complicated solutions.
  One can ask for LC_CTYPE or LANG or some such thing,
  and try to find out whether the current system is UTF-8,
  and only in that case set restrict_files_ctrl to false.
  
  I don't know anything about the Windows environment.
  
  Andries
  
  
  [Discussion:
  
  There is a flag --restrict-file-names. The manual page says
  By default, Wget escapes the characters that are not valid or safe
  
   as part of file names on your operating system, as well as control
   characters that are typically unprintable.
  
  Presently this is false: On a UTF-8 system Wget by default introduces
  illegal characters. The option nocontrol is needed to preserve the
  correct name.
  
  The flag is handled in init.c:cmd_spec_restrict_file_names()
  where opt.restrict_files_{os,case,ctrl,nonascii} are set.
  Of interest is the restrict_files_ctrl flag.
  Today init.c does by default:
  
  #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
  
opt.restrict_files_os = restrict_windows;
  
  #else
  
opt.restrict_files_os = restrict_unix;
  
  #endif
  
opt.restrict_files_ctrl = true;
opt.restrict_files_nonascii = false;

Re: [Bug-wget] bad filenames (again)

2015-08-07 Thread Tim Ruehsen
On Friday 07 August 2015 16:38:01 Andries E. Brouwer wrote:
 On Fri, Aug 07, 2015 at 04:14:45PM +0200, Tim Ruehsen wrote:
  Hi Andries,
  
  as I already mentioned, changing the default behavior of wget is not a
  good
  idea.
  
  But I started a wget2 branch that produces wget and wget2 executables.
  wget2's default behavior is to keep filenames as they are.
  
  I am not sure how it compiles and works on Windows (Cygwin could work).
  If you dare to check it out: any feedback is highly welcome.
  
  Regards, Tim
 
 Hi Tim,
 
 I disagree. This is just a bug.
 Nobody wants illegal filenames.
 Even removing them is not entirely trivial since the filenames
 produced by wget are not legal character sequences, so cannot be typed.

Hi Andries,

obviously I got it wrong.

If it's a bug, let's just fix it (without breaking compatibility).

I don't have the time to read *all* the old emails right now.
But as far as I understand escaping occurs within legal UTF-8 sequences - and 
you are right when saying this is a bug when we have a UTF-8 locale.

The solution would something like

if locale is UTF-8
  do not escape valid UTF-8 sequences
else
  keep wget's current behavior

If URLs (and thus filenames) are not in UTF-8, Wget will convert them to UTF-8 
before the above procedure (I guess that is what wget does anyways, well not 
100% sure).

Would you agree ?

If you provide patch for this we will appreciate that.

 I am a Linux man, no Windows computers here. So, I am happy to do
 stuff on Linux, but cannot test on Windows.

Sorry, won't bother you again regarding Windows ;-)

Tim




Re: [Bug-wget] bad filenames (again)

2015-08-07 Thread Andries E. Brouwer
On Fri, Aug 07, 2015 at 04:14:45PM +0200, Tim Ruehsen wrote:
 Hi Andries,
 
 as I already mentioned, changing the default behavior of wget is not a good 
 idea.
 
 But I started a wget2 branch that produces wget and wget2 executables.
 wget2's default behavior is to keep filenames as they are.
 
 I am not sure how it compiles and works on Windows (Cygwin could work).
 If you dare to check it out: any feedback is highly welcome.
 
 Regards, Tim

Hi Tim,

I disagree. This is just a bug.
Nobody wants illegal filenames.
Even removing them is not entirely trivial since the filenames
produced by wget are not legal character sequences, so cannot be typed.

So, I think this should be fixed, for example with my one-liner fix,
but I am quite happy to do something more complicated if that is
what people prefer.

I am a Linux man, no Windows computers here. So, I am happy to do
stuff on Linux, but cannot test on Windows.

Andries



Re: [Bug-wget] bad filenames (again)

2015-08-06 Thread Andries E. Brouwer
Today I again downloaded a large tree with wget and got only unusable filenames.
Fortunately I have the utility wgetfix that repairs the consequences
of this bug (see http://www.win.tue.nl/~aeb/linux/misc/wget.html ),
but nevertheless this wget bug should be fixed.

(Maybe it has been fixed already? I looked at this in detail last year,
and there was some correspondence but I think nothing happened.
Have not looked at the latest sources.)

What happens is that wget under certain circumstances escapes
certain bytes in a filename. I think that this was always a mistake,
but it did not occur very much and was defendable: filenames with
embedded control characters are a pain.

Today the situation is just the opposite: when copying from a remote
utf8 system to a local utf8 system correct and normal filenames
are escaped to create illegal filenames that cannot be used
and are worse than a pain, one cannot do much else than discard them.

What can the user do?
If she is on Windows, she is told to switch to Linux:

 I can't help Windows users, but Wget is a power-user tool. 
 And a Windows power-user should be able to start a virtual 
 machine with Linux running to use tools like Wget. 

Is she is on Linux, the easiest is to discard all that was downloaded
and start over again, this time with the option
--restrict-file-names=nocontrol

If the user knows about wgetfix, that is an alternative.

One can also use curl instead of wget.

See also

http://savannah.gnu.org/bugs/?37564
http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
http://stackoverflow.com/questions/27054765/wget-japanese-characters
http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-using-wget
http://www.win.tue.nl/~aeb/linux/misc/wget.html

Below I suggested an easy fix, and discussed some details.

Andries



On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote:
 On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
  On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
 
  If I ask wget to download the wikipedia page
 
  http://he.wikipedia.org/wiki/ש._שפרה
 
  then I hope for a resulting file ש._שפרה.
  Instead, wget gives me ש._שפר\327%94, where the \327
  is an unpronounceable byte that cannot be typed
  (This is an UTF-8 system and the filename
  that wget produces is not valid UTF-8.)
 
  Maybe it would be better if wget by default used the original filename.
  This name mangling is a vestige of old times, it seems to me.
  
  This is a commonly reported grievance and as you correctly mention a
  vestige of old times. With UTF-8 supported filesystems, Wget should
  simply write the correct characters.
  
  I sincerely hope this issue is resolved as fast as possible, but I
  know not how to. Those who understand i18n should work on this.
 
 It is very easy to resolve the issue, but I don't know how backwards
 compatible the wget developers want to be.
 
 The easiest solution is to change the line (in init.c:defaults())
   opt.restrict_files_ctrl = true;
 into
   opt.restrict_files_ctrl = false;
 
 That is what I would like to see:
 the default should be to preserve the name as-is,
 and there should be options escape_control or so
 to force the current default behaviour.
 
 There are also more complicated solutions.
 One can ask for LC_CTYPE or LANG or some such thing,
 and try to find out whether the current system is UTF-8,
 and only in that case set restrict_files_ctrl to false.
 
 I don't know anything about the Windows environment.
 
 Andries
 
 
 [Discussion:
 
 There is a flag --restrict-file-names. The manual page says
 By default, Wget escapes the characters that are not valid or safe
  as part of file names on your operating system, as well as control
  characters that are typically unprintable.
 Presently this is false: On a UTF-8 system Wget by default introduces
 illegal characters. The option nocontrol is needed to preserve the
 correct name.
 
 The flag is handled in init.c:cmd_spec_restrict_file_names()
 where opt.restrict_files_{os,case,ctrl,nonascii} are set.
 Of interest is the restrict_files_ctrl flag.
 Today init.c does by default:
 
 #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
   opt.restrict_files_os = restrict_windows;
 #else
   opt.restrict_files_os = restrict_unix;
 #endif
   opt.restrict_files_ctrl = true;
   opt.restrict_files_nonascii = false;
   opt.restrict_files_case = restrict_no_case_restriction;
 
 The value of these flags is used in url.c:append_uri_pathel
 where FILE_CHAR_TEST (*p, mask) is used to decide what bytes
 in the filename need quoting.
 
 This is too simplistic an approach: quoting is introduced
 in the middle of multibyte characters. So the current setup
 is buggy and wrong. Basically the choice is between making
 the unfortunately named nocontrol (it should be called
 preserve_name or so) the default and adding more heuristics
 to detect and solve the worst problems. For example,
 UTF-8 is easy to