Re: [Bug-wget] bad filenames (again)
On Mon, Aug 24, 2015 at 03:44:09PM +0200, Tim Ruehsen wrote: Just implemented (or let's say fixed) Content-Disposition in wget2. It now saves the file as 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf Good! Content-Disposition (filename, filename*) is standardized, but browsers seems to behave/parse very different, ignoring standards. Yes. On the web a general phenomenon is that non-specialists create websites. They know nothing about standards, but fiddle until it works (say, with IE6). Also Microsoft does/did not respect standards. A consequence is that practice is more important than theory. One has to try for robust solutions. I prefer to base the decision about what to do on the form of the filename (ASCII / UTF-8 / other), not on the headers encountered on the way to this file. I guess we can find an easy agreement. 1. Wget has to obey the defaults. If it fails or we find a well-known misbehavior (server/document fault), handle it automatically. That's how we try do do it now. 2. If still a problem arises, the user should be able to intercept. Using special command line options for fine-tuning Wget's behavior. Yes, whatever the user says, we do, the case where options have been given is nonproblematic. Remains your point 1. I am not sure what you think the defaults are. My basic example is the %-encoded pure ASCII url, referring to a non-text object. How should wget save the object? There is zero charset information. My answer today (after conversation with Eli) is: Decode the %-encoded string. The last part is the suggested filename. If it is ASCII, use that ASCII name (where valid for the OS). If it is UTF-8 (but not ASCII), use it when the locale is UTF-8, otherwise convert (if possible) or escape. If it is not UTF-8, escape. [That is, I would recognize only what is easy to recognize, and prefer not to rely on any headers. Prefer not to convert except possibly in the UTF-8 case.] How does your answer differ? Some ancient docs say that ISO-8859-1 is a default. Even if such docs have not yet been replaced, I feel that they no longer reflect current practice. ISO-8859-x is dying. All the web should converge to Unicode, whatever that may be. The relevant example might be that http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg I have the impression that you are happy with kn=C3=A4ckebr=C3=B6d.jpg but I would be unhappy with that (although it happens to be correct), since guessing and conversion is involved. Guessing may not be so bad, but guessing and converting is terrible: it can be really complicated to retrieve the original filename after an incorrect conversion. Andries Another URL: http://hongaarskinderplezier.eu/index.php?pagina=96naam=Gy%25F5r-Moson-Sopron This is about holidays near the beautiful city Győr in Hungary. But what happened with the city? Its name was written in ISO-8859-2, using 0xf5, and that was %-escaped to %f5, and that was again %-escaped to %25f5. I would undo the %-escape and see pure ASCII, and save as index.php?pagina=96naam=Gy%F5r-Moson-Sopron. What would you do? The page has meta charset=ISO-8859-2 / The headers have Content-Type: text/html without charset information. --- Similarly http://www.matklubben.se/recept/lchf+kn%25e4ckebr%25f6d+mandelmj%25f6l has the %-encoded version of Lchf kn%e4ckebr%f6d mandelmj%f6l which again encoded the ISO-8859-1 version of lchf knäckebröd mandelmjöl. Such double encodings are not uncommon. But as a first approximation I think wget should not try to recognize them. --- http://www.eet-china.com/SEARCH/ART/%EF%BC%85C0%EF%BC%85B6%E7%9A%84%EF%BC%85D1%E7%9A%84%EF%BC%85C0.HTM ends in %C0%B6的%D1的%C0.HTM - this is an %-encoding using fat %-signs (U+ff05). You see that one can encounter all levels of messiness.
Re: [Bug-wget] bad filenames (again)
On Saturday 22 August 2015 00:39:01 Andries E. Brouwer wrote: On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote: Content-Disposition: attachment; filename=20101202_%EB...%A8-%EB%B0%B1_.sgf This encodes a valid utf-8 filename, and that name should be used. So wget should save this file under the name 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf This is a different issue. Here we are talking about the encoding of HTTP headers, especially 'filename' values within Content-Disposition HTTP header. Wget simply does not parse this correctly - it is just not coded in. It is just Wget missing some code here (worth opening a separate bug). Good, saved for later. Just implemented (or let's say fixed) Content-Disposition in wget2. It now saves the file as 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf Content-Disposition (filename, filename*) is standardized, but browsers seems to behave/parse very different, ignoring standards. See http://stackoverflow.com/questions/93551/how-to-encode-the-filename-parameter-of-content-disposition-header-in-http (answer 2 from Martin Ørding-Thomsen) But that's just FYI. Different issue. If the server AND the document do not explicitly specify the character encoding, there still is one - namely the default. Has been ISO-8859-1 a while ago. AFAIR, HTML5 might have changed that (too late for me now to look it up). Yes - that is our main difference. You read the standard and find there what everyone is supposed to do, or what the default is. I download stuff from the net and encounter lots of things people do, that are perhaps not according to the most recent standard, and may differ from the default. As a consequence I prefer to base the decision about what to do on the form of the filename (ASCII / UTF-8 / other), not on the headers encountered on the way to this file. I guess we can find an easy agreement. 1. Wget has to obey the defaults. If it fails or we find a well-known misbehavior (server/document fault), handle it automatically. That's how we try do do it now. 2. If still a problem arises, the user should be able to intercept. Using special command line options for fine-tuning Wget's behavior. Of course we try our best, so that 2. is normally not necessary. You already gave some examples, one of it (the Content-Disposition example) already lead to an optimization (I'll transfer the code to Wget1.x soon). The other two obeyed the standards (one had f*cked up content, but that didn't touch Wget's functionality). I would ask you to give more examples of websites that you think aren't standard and/or where Wget has problems parsing out the links. That would be 50% of the work. (By the way, I checked my conjecture that iconv from UTF-8 to UTF-8 need not be the identity map, and that is indeed the case. On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.) We should have a 'shortcut', so if to-charset and from-charset are the same, we don't convert. Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] bad filenames (again)
Date: Sun, 23 Aug 2015 17:16:37 +0200 From: Ángel González keis...@gmail.com CC: bug-wget@gnu.org On 23/08/15 16:47, Eli Zaretskii wrote: Wrong. I can work with a larger one by using a UNC path. But then you will be unable to use relative file names, and will have to convert all the file names to the UNC format by hand, and any file names we create that exceed the 260-character limit will be almost unusable, since almost any program will be unable to read/write/delete/copy/whatever it. So this method is impractical, and it doesn't lift the limit anyway, see below. {{reference needed}} For what part do you need a reference? I'm quite sure explorer will happily work with UNC paths, which means the user will be able to flawlessly move/copy/delete them. No, the Explorer cannot handle files longer than 260 characters. The Explorer uses shell APIs that are limited to 260 characters. Like I said: creating files whose names are longer than 260 characters is asking for trouble. You will need to write your own programs to manipulate such files. And actually, I think most programs will happily open (and read, edit, etc.) a file that was provided in UNC format. UNC format is indeed supported by most (if not all) programs, but as soon as the file name is longer than 260 characters, all file-related APIs begin to fail. * _Some_ Windows when using _some_ filesystems / apis have fixed limits, but there are ways to produce larger paths... The issue here is not whether the size limits differ, the issue is whether the largest limit is still fixed. And it is, on Windows. I had tried to skip over the specific details in my previous mail. I didn't meant that the limit would be bigger, but that there isn't one (that you can rely on, at least). On Windows 95/98 you had this 260 character limit, and you currently still do depending on the API you are using. But that's not a system limit any more. This is wrong, and the URL I posted clearly describes the limitation: If you use UNCs, the size is still limited to 32K characters. So even if we want to convert every file name to the UNC \\?\x:\foo\bar form and create unusable files (which I don't recommend), the maximum length is still known in advance. Ok, it is possible that there *is* a limit of 32K characters. Still, it's not a practical one to hardcode. Why not? Here's a simple code snippet that should work: int open_utf8 (const char *fn, int mode) { wchar_t fn_utf16[32*1024]; int result = MultiByteToWideChar (CP_UTF8, MB_ERR_INVALID_CHARS, fn, -1, fn_utf16, 32*1024); if (!result) { DWORD err = GetLastError (); switch (err) { case ERROR_INVALID_FLAGS: case ERROR_INVALID_PARAMETER: errno = EINVAL; break; case ERROR_INSUFFICIENT_BUFFER: errno = ENAMETOOLONG; break; case ERROR_NO_UNICODE_TRANSLATION: default: errno = ENOENT; break; } return -1; } return _wopen (fn_utf16, mode); } And we would be risking a stack overflow if attempting to create such buffer in the stack. The default stack size of Windows programs is 2MB, so I think we are safe using 64K here.
Re: [Bug-wget] bad filenames (again)
On 20/08/15 04:42, Eli Zaretskii wrote: From: Ángel González wrote: On 19/08/15 16:38, Eli Zaretskii wrote: Indeed. Actually, there's no need to allocate memory dynamically, neither will malloc nor with alloca, since Windows file names have fixed size limitation that is known in advance. So each conversion function can use a fixed-sized local wchar_t array. Doing that will also avoid the need for 2 calls to MultiByteToWideChar, the first one to find out how much space to allocate. Nope. These functions would receive full path names, so there's no maximum length.* Please see the URL I mentioned earlier in this thread: _all_ Windows file-related APIs are limited to 260 characters, including the drive letter and all the leading directories. Wrong. I can work with a larger one by using a UNC path. * _Some_ Windows when using _some_ filesystems / apis have fixed limits, but there are ways to produce larger paths... The issue here is not whether the size limits differ, the issue is whether the largest limit is still fixed. And it is, on Windows. I had tried to skip over the specific details in my previous mail. I didn't meant that the limit would be bigger, but that there isn't one (that you can rely on, at least). On Windows 95/98 you had this 260 character limit, and you currently still do depending on the API you are using. But that's not a system limit any more.
Re: [Bug-wget] bad filenames (again)
Date: Sun, 23 Aug 2015 16:15:04 +0200 From: Ángel González keis...@gmail.com CC: bug-wget@gnu.org On 20/08/15 04:42, Eli Zaretskii wrote: From: Ángel González wrote: On 19/08/15 16:38, Eli Zaretskii wrote: Indeed. Actually, there's no need to allocate memory dynamically, neither will malloc nor with alloca, since Windows file names have fixed size limitation that is known in advance. So each conversion function can use a fixed-sized local wchar_t array. Doing that will also avoid the need for 2 calls to MultiByteToWideChar, the first one to find out how much space to allocate. Nope. These functions would receive full path names, so there's no maximum length.* Please see the URL I mentioned earlier in this thread: _all_ Windows file-related APIs are limited to 260 characters, including the drive letter and all the leading directories. Wrong. I can work with a larger one by using a UNC path. But then you will be unable to use relative file names, and will have to convert all the file names to the UNC format by hand, and any file names we create that exceed the 260-character limit will be almost unusable, since almost any program will be unable to read/write/delete/copy/whatever it. So this method is impractical, and it doesn't lift the limit anyway, see below. * _Some_ Windows when using _some_ filesystems / apis have fixed limits, but there are ways to produce larger paths... The issue here is not whether the size limits differ, the issue is whether the largest limit is still fixed. And it is, on Windows. I had tried to skip over the specific details in my previous mail. I didn't meant that the limit would be bigger, but that there isn't one (that you can rely on, at least). On Windows 95/98 you had this 260 character limit, and you currently still do depending on the API you are using. But that's not a system limit any more. This is wrong, and the URL I posted clearly describes the limitation: If you use UNCs, the size is still limited to 32K characters. So even if we want to convert every file name to the UNC \\?\x:\foo\bar form and create unusable files (which I don't recommend), the maximum length is still known in advance.
Re: [Bug-wget] bad filenames (again)
On 23/08/15 16:47, Eli Zaretskii wrote: Wrong. I can work with a larger one by using a UNC path. But then you will be unable to use relative file names, and will have to convert all the file names to the UNC format by hand, and any file names we create that exceed the 260-character limit will be almost unusable, since almost any program will be unable to read/write/delete/copy/whatever it. So this method is impractical, and it doesn't lift the limit anyway, see below. {{reference needed}} I'm quite sure explorer will happily work with UNC paths, which means the user will be able to flawlessly move/copy/delete them. And actually, I think most programs will happily open (and read, edit, etc.) a file that was provided in UNC format. * _Some_ Windows when using _some_ filesystems / apis have fixed limits, but there are ways to produce larger paths... The issue here is not whether the size limits differ, the issue is whether the largest limit is still fixed. And it is, on Windows. I had tried to skip over the specific details in my previous mail. I didn't meant that the limit would be bigger, but that there isn't one (that you can rely on, at least). On Windows 95/98 you had this 260 character limit, and you currently still do depending on the API you are using. But that's not a system limit any more. This is wrong, and the URL I posted clearly describes the limitation: If you use UNCs, the size is still limited to 32K characters. So even if we want to convert every file name to the UNC \\?\x:\foo\bar form and create unusable files (which I don't recommend), the maximum length is still known in advance. Ok, it is possible that there *is* a limit of 32K characters. Still, it's not a practical one to hardcode. And we would be risking a stack overflow if attempting to create such buffer in the stack.
Re: [Bug-wget] bad filenames (again)
On Fri, Aug 21, 2015 at 12:07:56PM +0200, Tim Ruehsen wrote: The charset is *not* determined (guessed) from the URL string, be it hex encoded or not. We take the locale setup as default, but it can be overridden by --local-encoding. Right now, Wget does not have the ability to have different encodings for file input (--input-file) and input via STDIN (when used at the same time). But that is another issue... It seems to me that I keep saying the same thing. We are not communicating. You talk about locale and local-encoding but that is not the point. There is a remote site. Nothing is known about this remote site. Certainly there is no reason to assume that it uses a character set that is related to the local setup of the machine here that runs wget. Since nothing is known about this remote site, it is impossible to know the character set (if any) of the filenames. And hence it is impossible to invoke iconv, since iconv requires a from-charset and a to-charset. Also the user does not know yet what character set this remote site is using. And it might use more than one. So the user cannot in general give a --from-charset option. In this situation: what do you do? Andries
Re: [Bug-wget] bad filenames (again)
On Friday 21 August 2015 13:00:34 Andries E. Brouwer wrote: On Fri, Aug 21, 2015 at 12:07:56PM +0200, Tim Ruehsen wrote: The charset is *not* determined (guessed) from the URL string, be it hex encoded or not. We take the locale setup as default, but it can be overridden by --local-encoding. Right now, Wget does not have the ability to have different encodings for file input (--input-file) and input via STDIN (when used at the same time). But that is another issue... It seems to me that I keep saying the same thing. We are not communicating. Yes, I am also under this impression :-( You talk about locale and local-encoding but that is not the point. Sorry, exactly that seems to be the point. There is a remote site. Nothing is known about this remote site. Wrong. Regarding HTTP(S), we exactly know the encoding of each downloaded HTML and CSS document (that's what I call 'remote encoding'). It is only these type of (downloaded) files we scan when going recursive. If the server (or document) states a wrong encoding (e.g. *saying* it has Japanese/EUC-JP encoding, but in fact it is iso-8859-1 encoded), we either have to use escaping or the user uses a --remote-encoding to override the wrong server/document statement. But leaving these misconfigured servers away as a special case, we are fine. You might take a look at http://www.w3.org/TR/html4/charset.html#h-5.2.2 which describes how servers and clients should work regarding HTML character encoding (there should be something for CSS as well out there). Andries, if you still have the impression that we are not communicating, I suggest that you make up a simple example test case to show your problem (and excuse me please for being kinda dump/blind). Maybe two small HTML files with references to each other to demonstrate your point. (I can put them on my server and start wget/wget2 on it to see if it works or not). Regards, Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] bad filenames (again)
On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote: There is a remote site. Nothing is known about this remote site. Wrong. Regarding HTTP(S), we exactly know the encoding of each downloaded HTML and CSS document (that's what I call 'remote encoding'). You are an optimist. In my experience Firefox rarely gets it right. Let me find some random site. Say http://web2go.board19.com/gopro/go_view.php?id=12345 If I go there with Firefox, I get a go board with a lot of mojibake around it. Firefox took the encoding to be Unicode. Trying out what I have to say in the Text encoding menu, it turns out to be Chinese, Traditional. Leaving these misconfigured servers away as a special case But most of the East Asian servers I meet are misconfigured in this way. They announce text/html with charset utf-8 and come with some random charset. So trusting this announced charset should be done cautiously. And you say misconfigured servers, but often one gets a Unix or Windows file hierarchy, and several character sets occur. The server doesnt know. The sysadmin doesnt know. A university machine will have many users with files in several languages and character sets. Moreover, the character set of a filename is in general unrelated to the character set of the contents of the file. That is most clear when the file is not a text file. What character set is the filename http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg in? You recognize ISO 8859-1 or similar. My local machine is on UTF-8. The HTTP headers say Content-Type: image/jpeg. How can wget guess? Andries
Re: [Bug-wget] bad filenames (again)
On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote: On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote: There is a remote site. Nothing is known about this remote site. Wrong. Regarding HTTP(S), we exactly know the encoding of each downloaded HTML and CSS document (that's what I call 'remote encoding'). You are an optimist. In my experience Firefox rarely gets it right. Let me find some random site. Say http://web2go.board19.com/gopro/go_view.php?id=12345 I try to be an optimist in all situations, yes :-) If I go there with Firefox, I get a go board with a lot of mojibake around it. Firefox took the encoding to be Unicode. Trying out what I have to say in the Text encoding menu, it turns out to be Chinese, Traditional. The server tell us the document is UTF-8. The document tell us it is 'UTF-8. But then, some moron (there are a lot of these dudes doing webpage 'design') put non UTF-8 text into the document. That is like putting plum pudding into a jar labeled 'strawberry jam'. You will you do ? Go back and return it ? Or accept it saying 'uh oh, my strawberry allergy will bite me, but I am a tough guy'. *BUT* that is not the point for wget, since wget doesn't mess around with the texttual content (no conversion takes place). When used recursive, wget will extract URLs from the document. *NOT* from the text but from the HTML tags/attributes. And *surprise*, all of the links in the document are UTF-8 / ASCII (else not a single browser in the world would expect anything else). And all that matters are the URLs from the HTML attributes. And you say misconfigured servers, but often one gets a Unix or Windows file hierarchy, and several character sets occur. The server doesnt know. The sysadmin doesnt know. A university machine will have many users with files in several languages and character sets. Trust them, They know. If not, their web site will be heavily broken. But there is nothing to fix for us. Moreover, the character set of a filename is in general unrelated to the character set of the contents of the file. That is most clear when the file is not a text file. What character set is the filename http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg Wrong question. It is a JPEG file. Content doesn't matter to wget. Despite from that, if you want to download the above mentioned web page and you have a UTF-8 locale, you have to tell wget via --local-encoding what encoding the URL is. But if wget --recursive finds the above URL within a HTML attribute, you won't need --local-encoding. By the measures taken from http://www.w3.org/TR/html4/charset.html#h-5.2.2, wget will know the correct encoding and just will do the right thing (after the currently discussed change regarding charsets / file naming). Wget2 already does it. $ wget --local-encoding=iso-8859-1 'http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg' --2015-08-21 16:30:05-- http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg Resolving www.win.tue.nl (www.win.tue.nl)... 131.155.0.177 Connecting to www.win.tue.nl (www.win.tue.nl)|131.155.0.177|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2015-08-21 16:30:05 ERROR 404: Not Found. --2015-08-21 16:30:05-- http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg Reusing existing connection to www.win.tue.nl:80. HTTP request sent, awaiting response... 200 OK Length: 11690 (11K) [image/jpeg] Saving to: ‘knäckebröd.jpg’ knäckebröd.jp 100%[=] 11.42K --.-KB/s in 0.002s 2015-08-21 16:30:05 (6.83 MB/s) - ‘knäckebröd.jpg’ saved [11690/11690] (Old wget having the progress bar bug.) Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] bad filenames (again)
On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote: On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote: Let me find some random site. Say http://web2go.board19.com/gopro/go_view.php?id=12345 The server tell us the document is UTF-8. The document tell us it is 'UTF-8. And it is not. So - this example establishes that remote character set information, when present, is often unreliable. Let me add one more example, http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html a famous Danish recipe. The headers say Content-Type: text/html without revealing any character set. Moreover, the character set of a filename is in general unrelated to the character set of the contents of the file. That is most clear when the file is not a text file. What character set is the filename http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg Wrong question. It is a JPEG file. Content doesn't matter to wget. Hmm. I thought the topic of our discussion was filenames and character sets. Here is a file, and its name is in ISO 8859-1. When wget saves it. What will the filename be? If you want to download the above mentioned web page and you have a UTF-8 locale, you have to tell wget via --local-encoding what encoding the URL is. Are you sure you do not mean --remote-encoding? But whatever you mean, it is an additional option. If the wget user already knows the character set, she can of course tell wget. The discussion is about the situation where the user does not know. So, that is the situation we are discussing: a remote site, the user does not know what encoding is used (she will find out after downloading), and the headers have either no information or wrong information. Now if one invokes iconv it is likely that garbage will be the result. Andries Here a Korean example. http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 The http headers say Content-Type: text/plain; charset=iso-8859-1 (which is incorrect), an internal header says that this is ISO-2022-KR (which is also incorrect), in fact the content is in EUC-KR. That is none of wget's business, we want to save this file. The headers say Content-Disposition: attachment; filename=20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf This encodes a valid utf-8 filename, and that name should be used. So wget should save this file under the name 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
Re: [Bug-wget] bad filenames (again)
Am Freitag, 21. August 2015, 17:28:09 schrieb Andries E. Brouwer: On Fri, Aug 21, 2015 at 04:34:36PM +0200, Tim Ruehsen wrote: On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote: Let me find some random site. Say http://web2go.board19.com/gopro/go_view.php?id=12345 The server tell us the document is UTF-8. The document tell us it is 'UTF-8. And it is not. So - this example establishes that remote character set information, when present, is often unreliable. Let me add one more example, http://www.win.tue.nl/~aeb/linux/lk/r%f8dgr%f8d.html a famous Danish recipe. The headers say Content-Type: text/html without revealing any character set. 1. There is no URL to parse in this document, so encoding does not matter anyway. 2. If the server AND the document do not explicitly specify the character encoding, there still is one - namely the default. Has been ISO-8859-1 a while ago. AFAIR, HTML5 might have changed that (too late for me now to look it up). The is a good diagram - maybe not perfectly up-to-date but it still shows roughly how to operate: http://nikitathespider.com/articles/EncodingDivination.html Moreover, the character set of a filename is in general unrelated to the character set of the contents of the file. That is most clear when the file is not a text file. What character set is the filename http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg Wrong question. It is a JPEG file. Content doesn't matter to wget. Hmm. I thought the topic of our discussion was filenames and character sets. Here is a file, and its name is in ISO 8859-1. When wget saves it. What will the filename be? If you want to download the above mentioned web page and you have a UTF-8 locale, you have to tell wget via --local-encoding what encoding the URL is. Are you sure you do not mean --remote-encoding? Yes, I am sure. Here my tests (my locale is UTF-8): Wrong: $ wget -nv --remote-encoding=iso-8859-1 http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg 2015-08-21 20:09:30 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg [11690/11690] - kn�ckebr�d.jpg.1 [1] Right: http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg: 2015-08-21 20:14:18 FEHLER 404: Not Found. 2015-08-21 20:14:18 URL:http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg [11690/11690] - knäckebröd.jpg [1] But whatever you mean, it is an additional option. If the wget user already knows the character set, she can of course tell wget. The discussion is about the situation where the user does not know. So, that is the situation we are discussing: a remote site, the user does not know what encoding is used (she will find out after downloading), and the headers have either no information or wrong information. Now if one invokes iconv it is likely that garbage will be the result. Here a Korean example. http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 The http headers say Content-Type: text/plain; charset=iso-8859-1 (which is incorrect), an internal header says that this is ISO-2022-KR (which is also incorrect), in fact the content is in EUC-KR. That is none of wget's business, we want to save this file. The headers say Content-Disposition: attachment; filename=20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_% EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%E B%B0%B1_.sgf This encodes a valid utf-8 filename, and that name should be used. So wget should save this file under the name 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf This is a different issue. Here we are talking about the encoding of HTTP headers, especially 'filename' values within Content-Disposition HTTP header. The above is correctly encoded (UTF-8 percent encoding). The encoding is described in RFC5987 (Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters). Wget simply does not parse this correctly - it is just not coded in. That is why support for Content-Disposition in Wget is documented as 'experimental' (you have to explicitly enable it via --content-disposition). Again the server encoding is known. Regarding filename encoding, nothing is wrong in your example. It is just Wget missing some code here (worth opening a separate bug). Default Wget behavior: $ wget -nv http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 2015-08-21 20:20:05 URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] - 1847B5314CF754B83134B7 [1] Enabled Content-Disposition support: $ wget -nv --content-disposition http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 2015-08-21 20:23:50 URL:http://cfile204.uf.daum.net/attach/1847B5314CF754B83134B7 [1441/1441] - 20101202_%EB%86%8D%EC%8B%AC%EC%8B%A0%EB%9D%BC%EB%A9%B4%EB%B0%B0_%EB%B0%94%EB%91%91(%EB%8B%A4%EC%B9%B4%EC%98%A4%EC%8B%A0%EC%A7%809%EB%8B%A8-%EB%B0%B1_.sgf [1] As we see, unescaping and UTF-8 to locale
Re: [Bug-wget] bad filenames (again)
On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote: Content-Disposition: attachment; filename=20101202_%EB...%A8-%EB%B0%B1_.sgf This encodes a valid utf-8 filename, and that name should be used. So wget should save this file under the name 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf This is a different issue. Here we are talking about the encoding of HTTP headers, especially 'filename' values within Content-Disposition HTTP header. Wget simply does not parse this correctly - it is just not coded in. It is just Wget missing some code here (worth opening a separate bug). Good, saved for later. If the server AND the document do not explicitly specify the character encoding, there still is one - namely the default. Has been ISO-8859-1 a while ago. AFAIR, HTML5 might have changed that (too late for me now to look it up). Yes - that is our main difference. You read the standard and find there what everyone is supposed to do, or what the default is. I download stuff from the net and encounter lots of things people do, that are perhaps not according to the most recent standard, and may differ from the default. As a consequence I prefer to base the decision about what to do on the form of the filename (ASCII / UTF-8 / other), not on the headers encountered on the way to this file. Fortunately, almost all URLs are in ASCII - no problem. Fortunately, almost all that are not in ASCII, are UTF-8. The good thing of UTF-8 is that it has a quite typical bit pattern. A non-ASCII filename that is valid UTF-8 is very likely UTF-8. So, one can recognize ASCII and UTF-8 rather reliably. (By the way, I checked my conjecture that iconv from UTF-8 to UTF-8 need not be the identity map, and that is indeed the case. On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.) Andries
Re: [Bug-wget] bad filenames (again)
On Friday 21 August 2015 02:08:43 Andries E. Brouwer wrote: On Thu, Aug 20, 2015 at 10:47:35AM +0200, Tim Ruehsen wrote: Basically, I keep track of the charset of each URL input (command line, input file, stdin, downloaded+scanned). It seems to me, you can't. Consider for example a command line that gives a URL hex escaped. Now the command line is pure ASCII and gives no information at all about the character set of the filename. The charset is *not* determined (guessed) from the URL string, be it hex encoded or not. We take the locale setup as default, but it can be overridden by --local-encoding. Right now, Wget does not have the ability to have different encodings for file input (--input-file) and input via STDIN (when used at the same time). But that is another issue... Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] bad filenames (again)
On Wed, Aug 19, 2015 at 05:38:39PM +0300, Eli Zaretskii wrote: Assign a character set as follows: - if the user specified a from-charset, use that - if the name is printable ASCII (in 0x20-0x7f), take ASCII - if the name is non-ASCII and valid UTF-8, take UTF-8 - otherwise take Unknown. I think this is simpler and produces the same results: - if the user specified a from-charset, use that - otherwise assume UTF-8 Simpler, but the results are not the same. If the from-charset is unknown, then any call of iconv will certainly lead to bad results. So there are only the two possibilities: (i) leave as-is (if that is the user's preference) (ii) make pure ASCII via hex escapes. Andries
Re: [Bug-wget] bad filenames (again)
From: Tim Ruehsen tim.rueh...@gmx.de Cc: Andries E. Brouwer andries.brou...@cwi.nl Date: Thu, 20 Aug 2015 10:47:35 +0200 Tim says he has some/most of that coded on a branch, so I think we should start by merging that branch, and then take it from there. It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 'click on the merge button' to merge. Basically, I keep track of the charset of each URL input (command line, input file, stdin, downloaded+scanned). So when generating the filename we have the to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), escaping takes place. Sounds good to me. Is something holding the merge of this to master?
Re: [Bug-wget] bad filenames (again)
On Wed, Aug 19, 2015 at 09:46:04PM +0300, Eli Zaretskii wrote: OK, but how is this different from what we'd get using your suggested 4 alternatives? What can I reply? Just read my letter again. I think I said what I wanted to say. Andries
Re: [Bug-wget] bad filenames (again)
On Thursday 20 August 2015 17:39:09 Eli Zaretskii wrote: From: Tim Ruehsen tim.rueh...@gmx.de Cc: Andries E. Brouwer andries.brou...@cwi.nl Date: Thu, 20 Aug 2015 10:47:35 +0200 Tim says he has some/most of that coded on a branch, so I think we should start by merging that branch, and then take it from there. It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 'click on the merge button' to merge. Basically, I keep track of the charset of each URL input (command line, input file, stdin, downloaded+scanned). So when generating the filename we have the to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), escaping takes place. Sounds good to me. Is something holding the merge of this to master? Sorry it should have been so you *can't* just 'click on the merge button' to merge :-) I have to do some more organizational stuff over there before I introduce an official alpha version (but it is working already with a bunch of new features). Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] bad filenames (again)
On Wed, Aug 19, 2015 at 10:46:30PM +0300, Eli Zaretskii wrote: OK, then let me explain my line of reasoning. Plain ASCII is valid UTF-8, and if converting with iconv assuming it's UTF-8 fails, you know it's not valid UTF-8. So the last 3 possibilities in your suggestion boil down to try converting as if it were UTF-8, and if that fails, you know it's Unknown. Yes, although I would not invoke iconv to actually convert from UTF-8 to UTF-8. Unicode is a complicated beast, and it is not certain that conversion from UTF-8 to UTF-8 is the identity transformation. (For example, implementations may prefer either NFC or NFD. MacOS has its own NFD-like version for filenames.) But you are right, one can use it as test. After finding out that the charset is unknown I want to hex-encode the entire filename. On the other hand, if the appropriate thing is to invoke iconv to convert from one charset to another, I want to hex-encode only the failing bytes. This difference because (a) if there is reason to expect that conversion should be possible, for example because the user specified the from-charset as GB18030, and it fails, then often only in a few isolated places where Microsoft extensions are used, and it is more user-friendly to do the conversion where possible. but (b) if nothing is known, then the character set can be a multibyte one like SJIS where ASCII bytes occur as second halves of symbols, and not escaping such ASCII bytes is confusing and sometimes leads to strange problems. Andries
Re: [Bug-wget] bad filenames (again)
On Wednesday 19 August 2015 17:38:39 Eli Zaretskii wrote: Date: Wed, 19 Aug 2015 02:52:57 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: bug-wget@gnu.org Look at the remote filename. Assign a character set as follows: - if the user specified a from-charset, use that - if the name is printable ASCII (in 0x20-0x7f), take ASCII - if the name is non-ASCII and valid UTF-8, take UTF-8 - otherwise take Unknown. I think this is simpler and produces the same results: - if the user specified a from-charset, use that - otherwise assume UTF-8 Determine a local character set as follows: - if the user specified a to-charset, use that - if the locale uses UTF-8, use that - otherwise take ASCII I suggest this instead: - if the user specified a to-charset, use that - otherwise, call nl_langinfo(CODESET) to find out the current locale's encoding Convert the name from from-charset to to-charset: - if the user asked for unmodified filenames, do nothing - if the name is ASCII, do nothing - if the name is UTF-8 and the locale uses UTF-8, do nothing - convert from Unknown by hex-escaping the entire name - convert to ASCII by hex-escaping the entire name - otherwise invoke iconv(); upon failure, escape the illegal bytes My suggestion: - if the user asked for unmodified filenames, do nothing - else invoke 'iconv' to convert from remote to local encoding - if 'iconv' fails, convert to ASCII by hex-escaping Hex-escaping only the bytes that fail 'iconv' is better than hex-escaping all of them, but it's more complex, and I'm not sure it's worth the hassle. But if it can be implemented without undue trouble, I'm all for it, as it will make wget more user-friendly in those cases. Once we know what we want it is trivial to write the code, but it may take a while to figure out what we want. I think we should start applying the current patch. Tim says he has some/most of that coded on a branch, so I think we should start by merging that branch, and then take it from there. It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 'click on the merge button' to merge. Basically, I keep track of the charset of each URL input (command line, input file, stdin, downloaded+scanned). So when generating the filename we have the to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), escaping takes place. Tim
Re: [Bug-wget] bad filenames (again)
Date: Wed, 19 Aug 2015 02:52:57 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: bug-wget@gnu.org Look at the remote filename. Assign a character set as follows: - if the user specified a from-charset, use that - if the name is printable ASCII (in 0x20-0x7f), take ASCII - if the name is non-ASCII and valid UTF-8, take UTF-8 - otherwise take Unknown. I think this is simpler and produces the same results: - if the user specified a from-charset, use that - otherwise assume UTF-8 Determine a local character set as follows: - if the user specified a to-charset, use that - if the locale uses UTF-8, use that - otherwise take ASCII I suggest this instead: - if the user specified a to-charset, use that - otherwise, call nl_langinfo(CODESET) to find out the current locale's encoding Convert the name from from-charset to to-charset: - if the user asked for unmodified filenames, do nothing - if the name is ASCII, do nothing - if the name is UTF-8 and the locale uses UTF-8, do nothing - convert from Unknown by hex-escaping the entire name - convert to ASCII by hex-escaping the entire name - otherwise invoke iconv(); upon failure, escape the illegal bytes My suggestion: - if the user asked for unmodified filenames, do nothing - else invoke 'iconv' to convert from remote to local encoding - if 'iconv' fails, convert to ASCII by hex-escaping Hex-escaping only the bytes that fail 'iconv' is better than hex-escaping all of them, but it's more complex, and I'm not sure it's worth the hassle. But if it can be implemented without undue trouble, I'm all for it, as it will make wget more user-friendly in those cases. Once we know what we want it is trivial to write the code, but it may take a while to figure out what we want. I think we should start applying the current patch. Tim says he has some/most of that coded on a branch, so I think we should start by merging that branch, and then take it from there.
Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 22:28:21 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org What is needed to have a full Unicode support in wget on Windows is to provide replacements for all the file-name related libc functions ('fopen', 'open', 'stat', 'access', etc.) which will accept file names encoded in UTF-8, convert them internally into UTF-16, and call the wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat', '_waccess', etc.) with the converted file name. Another thing that is needed is similar replacements for 'printf', 'puts', 'fprintf', etc. when they are used for writing file names to the console -- because we cannot write UTF-8 sequences to the Windows console. Aha. That reminds me of a patch by I think Aleksey Bykov. Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html There we had a similar discussion, and he wrote mswindows.diff with +int +wc_utime (unsigned char *filename, struct _utimbuf *times) +{ + wchar_t *w_filename; + int buffer_size; + + buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, w_filename, 0); + w_filename = alloca (buffer_size); + MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size); + return _wutime (w_filename, times); +} and similar for stat, open, etc. Something similar is what would be needed on Windows? Yes, thanks for pointing out those patches. Any reasons they weren't accepted back then? Is his patch usable? It needs some minor polishing, but in general it should do the job, yes. I admit that I don't understand the need for the url.c patch. Why do we need to convert to wchar_t when the locale's codeset is already UTF-8? (I could understand that for non-UTF-8 locales, but the patch explicitly limits the conversion to wchar_t and back to UTF-8 locales, where the normal string functions should do the job.) Is this only for converting to upper/lower-case? There's still the part with writing UTF-8 encoded file/URL names to the Windows console; that will have to be added.
Re: [Bug-wget] bad filenames (again)
Date: Wed, 19 Aug 2015 01:43:51 +0200 From: Ángel González keis...@gmail.com +int +wc_utime (unsigned char *filename, struct _utimbuf *times) +{ + wchar_t *w_filename; + int buffer_size; + + buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, w_filename, 0); + w_filename = alloca (buffer_size); + MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size); + return _wutime (w_filename, times); +} and similar for stat, open, etc. Something similar is what would be needed on Windows? Is his patch usable? Maybe I also commented a little in http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html but after that nothing happened, it seems. That would probably work, but would need a review. On a quick look, some of the functions have memory leaks (seems he first used malloc, then changed to alloca just some of them). Indeed. Actually, there's no need to allocate memory dynamically, neither will malloc nor with alloca, since Windows file names have fixed size limitation that is known in advance. So each conversion function can use a fixed-sized local wchar_t array. Doing that will also avoid the need for 2 calls to MultiByteToWideChar, the first one to find out how much space to allocate. And of course, there's the question of what to do if the filename we are trying to convert to utf-16 is not in fact valid utf-8. The calls to MultiByteToWideChar should use a flag (MB_ERR_INVALID_CHARS) in its 2nd argument that makes the function fail with a distinct error code in that case. When it fails like that, the wc_* wrappers should simply call the normal unibyte functions with the original 'char *' argument. This makes the modified code fall back on previous behavior when the source file names are not in UTF-8. And regardless, wget should convert to the locale's codeset (on all platforms). Once the above patches are accepted, the Windows build will pretend that its locale's codeset is UTF-8, and that will ensure the conversions with MultiByteToWideChar will work in most situations.
Re: [Bug-wget] bad filenames (again)
Date: Wed, 19 Aug 2015 20:50:55 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, keis...@gmail.com, bug-wget@gnu.org On Wed, Aug 19, 2015 at 09:46:04PM +0300, Eli Zaretskii wrote: OK, but how is this different from what we'd get using your suggested 4 alternatives? What can I reply? Just read my letter again. I think I said what I wanted to say. OK, then let me explain my line of reasoning. Plain ASCII is valid UTF-8, and if converting with iconv assuming it's UTF-8 fails, you know it's not valid UTF-8. So the last 3 possibilities in your suggestion boil down to try converting as if it were UTF-8, and if that fails, you know it's Unknown.
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 11:58:54AM +0200, Tim Ruehsen wrote: Unix filenames are sequences of bytes, they do not have a character set. The character encoding makes with what symbols these bytes (or byte sequences aka multibyte / codepoints) are displayed for you. Sure. So each time I load a different font, I see different glyphs for my symbols. The file with single-byte name 0xff will look like a Dutch ligature ij in some fonts, and quite different in other fonts. The point is: it is the user's choice to load a font. (Or to set a locale.) The filenames themselves do not carry additional information about their character set. For historical reasons a single directory can have files with names in several character sets. All this is about the local situation. One cannot know the character set of a filename because that concept does not exist in Unix. About the remote situation even less is known. It would be terrible if wget decided to use obscure heuristics to invent a remote character set and then invoke iconv. Andries
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 10:29:40AM +0200, Tim Ruehsen wrote: I am going with Eli that we should use iconv. We know the remote encoding and the local encoding Do we? How do you guess the remote encoding? Is there any particular encoding? Unix filenames are sequences of bytes, they do not have a character set. Andries
Re: [Bug-wget] bad filenames (again)
On Monday 17 August 2015 22:51:12 Andries E. Brouwer wrote: On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote: what do we want to achieve here, and why is what wget did before your patch the wrong thing? Wget modified filenames, and users are unhappy. See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745 http://savannah.gnu.org/bugs/?37564 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors http://stackoverflow.com/questions/27054765/wget-japanese-characters http://www.win.tue.nl/~aeb/linux/misc/wget.html etc. It is debatable what precisely would be the right thing, but my patch greatly increases the number of happy users. Further improvement is possible. For example, nothing was changed yet for Windows, but also Windows users complain about this wget escaping. I am going with Eli that we should use iconv. We know the remote encoding and the local encoding, so I don't see a problem here. There are a few cases (when using --input-file) where we have to tell wget the encoding via --remote-encoding. On Windows we very often have the default locale Windows-1252 (aka CP1252) which is a superset of iso-8859-1. While web servers more and more often deliver content encoded as UTF-8. A UTF-8 filename of 'ö.html' (\C3x\B6x.html) should be saved as CP1252 ö.html (\F6x.html). If conversion is not possible due to characters not included into CP1252, we should fallback to escaping ( as improvement we could first try to convert codepoint by codepoint and just escape the ones not convertable). This already done in 'wget2' branch where it can be tested (using src2/wget2). We just have to backport it to Wget 'master' branch. For me, this is just a matter of available time. Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] bad filenames (again)
On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote: what do we want to achieve here, and why is what wget did before your patch the wrong thing? Wget modified filenames, and users are unhappy. See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745 http://savannah.gnu.org/bugs/?37564 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors http://stackoverflow.com/questions/27054765/wget-japanese-characters http://www.win.tue.nl/~aeb/linux/misc/wget.html etc. It is debatable what precisely would be the right thing, but my patch greatly increases the number of happy users. Further improvement is possible. For example, nothing was changed yet for Windows, but also Windows users complain about this wget escaping. Andries
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 07:39:40PM +0300, Eli Zaretskii wrote: No. An exact copy allows me to decide what I have. Which is the heuristic you want this to be solved. IMO, such a heuristic will not server most of the users in most of use cases. Users just want wget to DTRT automatically, and have the file names legible. Let me see whether I understand you correctly. You want to do the right thing. You think that the right thing would be to invoke iconv. Since the original character set is unknown to user and wget, you have to guess. What could one guess? If the string is ASCII, fine. If the string is valid UTF-8, fine. If the user has specified the character set, fine. Otherwise? Leave it as it is? Andries
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 07:43:05PM +0300, Eli Zaretskii wrote: If we convert the file names using iconv, Windows users will also be happier, at least when the remote URL can be encoded in their system codepage. Windows does not differ from Unix - since the remote character set is unknown and not necessarily constant, a conversion is impossible. Windows does differ from Unix, in that arbitrary byte sequences cannot be used in file names. Of course. The code already tries to take care of that. See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx for the gory details. Thanks for the reference! I already indicated the 1-line change that fixes the Windows problems. It doesn't, unfortunately. You are too brief. What is wrong with the change that changes /* insert some test for Windows */ into return true; ? That change only changes what wget does with bytes in the 128-159 range, and reading the gory details I fail to see any problem. Almost the opposite: Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255) At first sight, if there were a problem it would be because of the clause Any other character that the target file system does not allow. Thanks to your reference I now feel confident to make that 1-line change so that also Windows users are happy. Andries (There are restrictions involving filenames that wget perhaps does not enforce: no LPT3, no final space or period, ... It might be useful to teach wget about such details.)
Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 19:51:58 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org On Tue, Aug 18, 2015 at 07:43:05PM +0300, Eli Zaretskii wrote: If we convert the file names using iconv, Windows users will also be happier, at least when the remote URL can be encoded in their system codepage. Windows does not differ from Unix - since the remote character set is unknown and not necessarily constant, a conversion is impossible. Windows does differ from Unix, in that arbitrary byte sequences cannot be used in file names. Of course. The code already tries to take care of that. It does that badly. See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx for the gory details. Thanks for the reference! You are welcome. I already indicated the 1-line change that fixes the Windows problems. It doesn't, unfortunately. You are too brief. What is wrong with the change that changes /* insert some test for Windows */ into return true; ? It preserves the current behavior, whereby almost every non-ASCII URL out there gets saved in a file name that is either inaccessible to localized programs, or shows as illegible mujibake. That change only changes what wget does with bytes in the 128-159 range, and reading the gory details I fail to see any problem. Almost the opposite: Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255) You need to read between the lines, as it's Microsoft speak. First, not every codepoint between 128 and 255 is valid in every codepage. Second, Windows stores file names in UTF-16, so it attempts to convert the byte stream into UTF-16 assuming the byte stream is in the current codepage (which is incorrect in most cases, as we get UTF-8 instead). The result is an utmost mess. Thanks to your reference I now feel confident to make that 1-line change so that also Windows users are happy. Do you still think that? Then allow me a small demonstration: D:\usr\eli\datawget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5 --2015-08-18 21:23:38-- https://ru.wikipedia.org/wiki/%D7%80%C2%A1%D7%80%C2%B5%D7%81%E2%82%AC%D7%80%C2%B4%D7%81%E2%80%A0%D7%80%C2%B5 Loaded CA certificate 'd:/usr/etc/ssl/ca-bundle.crt' Resolving ru.wikipedia.org (ru.wikipedia.org)... 91.198.174.192 Connecting to ru.wikipedia.org (ru.wikipedia.org)|91.198.174.192|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2015-08-18 21:23:39 ERROR 404: Not Found. --2015-08-18 21:23:39-- https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5 Reusing existing connection to ru.wikipedia.org:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' ╫%80┬í╫%80┬╡╫%81Γ%8 [ = ] 180.32K 923KB/s in 0.2s 2015-08-18 21:23:40 (923 KB/s) - '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' saved [184652] Do you really think that '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' is a good way to express 'Сердце'? Do you think someone will be able to read and understand such a file name? How would you go about converting it back to what it should be? (There are restrictions involving filenames that wget perhaps does not enforce: no LPT3, no final space or period, ... It might be useful to teach wget about such details.) Indeed. But that's a different issue, I think.
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 09:15:40PM +0300, Eli Zaretskii wrote: Otherwise? Leave it as it is? No, encode it as %XX hex escapes, thus making the file name pure ASCII. And have an option to leave it as is, so people who want that could have that. OK, I can live with that. On Tue, Aug 18, 2015 at 09:32:16PM +0300, Eli Zaretskii wrote: Second, Windows stores file names in UTF-16, so it attempts to convert the byte stream into UTF-16 assuming the byte stream is in the current codepage (which is incorrect in most cases, as we get UTF-8 instead). The result is an utmost mess. Yes, conversion always leads to a problems. So, I see that you want to use iconv to convert UTF-8 to the current codepage, so that Windows can convert that to UTF-16 again. As stated several times already I have zero experience on Windows, but is it possible to let wget change its current codepage to Unicode so that the Windows conversion is close to the identity map? It seems silly to have a double conversion with data loss if just a format conversion would suffice. Andries
Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 21:32:16 +0300 From: Eli Zaretskii e...@gnu.org Cc: bug-wget@gnu.org --2015-08-18 21:23:39-- https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5 Reusing existing connection to ru.wikipedia.org:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' ╫%80┬í╫%80┬╡╫%81Γ%8 [ = ] 180.32K 923KB/s in 0.2s 2015-08-18 21:23:40 (923 KB/s) - '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' saved [184652] Do you really think that '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' is a good way to express 'Сердце'? Do you think someone will be able to read and understand such a file name? How would you go about converting it back to what it should be? And of course the file name that is written is yet a different mojibake: '׳%80ֲ¡׳%80ֲµ׳%81ג%82¬׳%80ֲ´׳%81ג%80 ׳%80ֲµ' (copied from the directory listing displayed by UTF-16 capable Emacs). Note that it has right-to-left characters in it (probably because my locale is for the Hebrew language), to make it even less legible due to display-time reordering per the Unicode UAX#9.
Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 21:11:25 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org On Tue, Aug 18, 2015 at 09:15:40PM +0300, Eli Zaretskii wrote: Otherwise? Leave it as it is? No, encode it as %XX hex escapes, thus making the file name pure ASCII. And have an option to leave it as is, so people who want that could have that. OK, I can live with that. Great, I'm glad we've found an agreeable compromise. So, I see that you want to use iconv to convert UTF-8 to the current codepage, so that Windows can convert that to UTF-16 again. Yes. As stated several times already I have zero experience on Windows, but is it possible to let wget change its current codepage to Unicode so that the Windows conversion is close to the identity map? No, it's not possible. Windows does have a UTF-8 codepage, but it doesn't allow setting that as the system codepage. What is needed to have a full Unicode support in wget on Windows is to provide replacements for all the file-name related libc functions ('fopen', 'open', 'stat', 'access', etc.) which will accept file names encoded in UTF-8, convert them internally into UTF-16, and call the wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat', '_waccess', etc.) with the converted file name. Another thing that is needed is similar replacements for 'printf', 'puts', 'fprintf', etc. when they are used for writing file names to the console -- because we cannot write UTF-8 sequences to the Windows console. Doing this is not rocket science (I did something similar for Emacs last year), but more work than just a call to iconv that's needed on Unix.
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 10:31:31PM +0300, Eli Zaretskii wrote: Is it possible to let wget change its current codepage to Unicode so that the Windows conversion is close to the identity map? No, it's not possible. Windows does have a UTF-8 codepage, but it doesn't allow setting that as the system codepage. What is needed to have a full Unicode support in wget on Windows is to provide replacements for all the file-name related libc functions ('fopen', 'open', 'stat', 'access', etc.) which will accept file names encoded in UTF-8, convert them internally into UTF-16, and call the wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat', '_waccess', etc.) with the converted file name. Another thing that is needed is similar replacements for 'printf', 'puts', 'fprintf', etc. when they are used for writing file names to the console -- because we cannot write UTF-8 sequences to the Windows console. Aha. That reminds me of a patch by I think Aleksey Bykov. Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html There we had a similar discussion, and he wrote mswindows.diff with +int +wc_utime (unsigned char *filename, struct _utimbuf *times) +{ + wchar_t *w_filename; + int buffer_size; + + buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, w_filename, 0); + w_filename = alloca (buffer_size); + MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size); + return _wutime (w_filename, times); +} and similar for stat, open, etc. Something similar is what would be needed on Windows? Is his patch usable? Maybe I also commented a little in http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html but after that nothing happened, it seems. Andries
Re: [Bug-wget] bad filenames (again)
On 18/08/15 22:28, Andries E. Brouwer wrote: On Tue, Aug 18, 2015 at 10:31:31PM +0300, Eli Zaretskii wrote: No, it's not possible. Windows does have a UTF-8 codepage, but it doesn't allow setting that as the system codepage. What is needed to have a full Unicode support in wget on Windows is to provide replacements for all the file-name related libc functions ('fopen', 'open', 'stat', 'access', etc.) which will accept file names encoded in UTF-8, convert them internally into UTF-16, and call the wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat', '_waccess', etc.) with the converted file name. Another thing that is needed is similar replacements for 'printf', 'puts', 'fprintf', etc. when they are used for writing file names to the console -- because we cannot write UTF-8 sequences to the Windows console. Aha. That reminds me of a patch by I think Aleksey Bykov. Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html There we had a similar discussion, and he wrote mswindows.diff with +int +wc_utime (unsigned char *filename, struct _utimbuf *times) +{ + wchar_t *w_filename; + int buffer_size; + + buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, -1, w_filename, 0); + w_filename = alloca (buffer_size); + MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size); + return _wutime (w_filename, times); +} and similar for stat, open, etc. Something similar is what would be needed on Windows? Is his patch usable? Maybe I also commented a little in http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html but after that nothing happened, it seems. Andries That would probably work, but would need a review. On a quick look, some of the functions have memory leaks (seems he first used malloc, then changed to alloca just some of them). And of course, there's the question of what to do if the filename we are trying to convert to utf-16 is not in fact valid utf-8.
Re: [Bug-wget] bad filenames (again)
On Wed, Aug 19, 2015 at 01:43:51AM +0200, Ángel González wrote: And of course, there's the question of what to do if the filename we are trying to convert to utf-16 is not in fact valid utf-8. My current understanding: (i) there is a current patch, that fixes most problems on Unix and can be applied today (ii) one also wants to fix Windows problems, and in the process do something more general for Unix. We can discuss a future patch that does something like: Look at the remote filename. Assign a character set as follows: - if the user specified a from-charset, use that - if the name is printable ASCII (in 0x20-0x7f), take ASCII - if the name is non-ASCII and valid UTF-8, take UTF-8 - otherwise take Unknown. Determine a local character set as follows: - if the user specified a to-charset, use that - if the locale uses UTF-8, use that - otherwise take ASCII Convert the name from from-charset to to-charset: - if the user asked for unmodified filenames, do nothing - if the name is ASCII, do nothing - if the name is UTF-8 and the locale uses UTF-8, do nothing - convert from Unknown by hex-escaping the entire name - convert to ASCII by hex-escaping the entire name - otherwise invoke iconv(); upon failure, escape the illegal bytes See whether the resulting name can be used. On Unix all strings (without NUL and '/') are ok. On Windows there are many restrictions. Further hex escape problematic characters on Windows. Since conversions to 8-bit character sets will often fail, it is desirable to convince Windows to use Unicode as current codeset. Maybe that requires a copy of the common fileio routines. That is my view of the result of the present conversation. Probably some refinements will be needed. Moreover, there is interference with iri stuff that should be looked at. Once we know what we want it is trivial to write the code, but it may take a while to figure out what we want. I think we should start applying the current patch. Andries
Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 12:55:50 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: bug-wget@gnu.org, Andries E. Brouwer andries.brou...@cwi.nl, Eli Zaretskii e...@gnu.org The point is: it is the user's choice to load a font. (Or to set a locale.) Most users never change a locale, unless they are trying something special, precisely because their file names will display as mujibake. So wget should IMO by default cater to this use case, and allow saving the bytes verbatim as an option. For historical reasons a single directory can have files with names in several character sets. Again, this is a rare situation. We shouldn't punish the majority on behalf of such rare use cases. All this is about the local situation. One cannot know the character set of a filename because that concept does not exist in Unix. Of course, it exists. The _filesystem_ doesn't know it, but users do. About the remote situation even less is known. Assuming UTF-8 will go a long way towards resolving this. When this is not so, we have the --remote-encoding switch. It would be terrible if wget decided to use obscure heuristics to invent a remote character set and then invoke iconv. But what you suggest instead -- create a file name whose bytes are an exact copy of the remote -- is just another heuristic. And the effects are no less terrible, because file names will become illegible, especially on systems where UTF-8 is not the locale's codeset. I'm okay with having an option to do that, but it shouldn't be the default, IMO.
Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 17:28:34 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org About the remote situation even less is known. Assuming UTF-8 will go a long way towards resolving this. When this is not so, we have the --remote-encoding switch. This is wget. The user is recursively downloading a file hierarchy. Only after downloading does it become clear what one has got. In some use cases, yes. In most others, no: the encoding is known in advance. I download a collection of East Asian texts on some topic. Upon examination, part is in SJIS, part in Big5, part in EUC-JP, part in UTF-8. Since the downloaded stuff does not have a uniform character set, and surely the server is not going to specify character sets, any invocation of iconv will corrupt my data. When I get the unmodified data I look using browser or editor or xterm+luit for which character set setting I get readable text. I already said that wget should support this use case. I just don't think it should be the default. It would be terrible if wget decided to use obscure heuristics to invent a remote character set and then invoke iconv. But what you suggest instead -- create a file name whose bytes are an exact copy of the remote -- is just another heuristic. No. An exact copy allows me to decide what I have. Which is the heuristic you want this to be solved. IMO, such a heuristic will not server most of the users in most of use cases. Users just want wget to DTRT automatically, and have the file names legible. Conversion leads to data loss. When it does, or there's a risk that it does, users should use optional features to countermand that.
Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 17:56:30 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org For example, nothing was changed yet for Windows, but also Windows users complain about this wget escaping. If we convert the file names using iconv, Windows users will also be happier, at least when the remote URL can be encoded in their system codepage. Windows does not differ from Unix - since the remote character set is unknown and not necessarily constant, a conversion is impossible. Windows does differ from Unix, in that arbitrary byte sequences cannot be used in file names. See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx for the gory details. I already indicated the 1-line change that fixes the Windows problems. It doesn't, unfortunately.
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 05:45:13PM +0300, Eli Zaretskii wrote: All this is about the local situation. One cannot know the character set of a filename because that concept does not exist in Unix. Of course, it exists. The _filesystem_ doesn't know it, but users do. Usually, yes. About the remote situation even less is known. Assuming UTF-8 will go a long way towards resolving this. When this is not so, we have the --remote-encoding switch. This is wget. The user is recursively downloading a file hierarchy. Only after downloading does it become clear what one has got. I download a collection of East Asian texts on some topic. Upon examination, part is in SJIS, part in Big5, part in EUC-JP, part in UTF-8. Since the downloaded stuff does not have a uniform character set, and surely the server is not going to specify character sets, any invocation of iconv will corrupt my data. When I get the unmodified data I look using browser or editor or xterm+luit for which character set setting I get readable text. It would be terrible if wget decided to use obscure heuristics to invent a remote character set and then invoke iconv. But what you suggest instead -- create a file name whose bytes are an exact copy of the remote -- is just another heuristic. No. An exact copy allows me to decide what I have. Conversion leads to data loss. Andries
Re: [Bug-wget] bad filenames (again)
On Tue, Aug 18, 2015 at 06:22:41PM +0300, Eli Zaretskii wrote: It is debatable what precisely would be the right thing, but my patch greatly increases the number of happy users. AFAIU, it does that only when the target locale is UTF-8. By using iconv we can make wget DTRT in more locales. No, because wget, and the invoker of wget, does not know the remote character set. And there need not be one. A Chinese site often has a mixture of material in Traditional Chinese and Simplified Chinese. Any conversion would just make the stuff unreadable. For example, nothing was changed yet for Windows, but also Windows users complain about this wget escaping. If we convert the file names using iconv, Windows users will also be happier, at least when the remote URL can be encoded in their system codepage. Windows does not differ from Unix - since the remote character set is unknown and not necessarily constant, a conversion is impossible. I already indicated the 1-line change that fixes the Windows problems. Andries
Re: [Bug-wget] bad filenames (again)
Date: Mon, 17 Aug 2015 22:51:12 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote: what do we want to achieve here, and why is what wget did before your patch the wrong thing? Wget modified filenames, and users are unhappy. See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745 http://savannah.gnu.org/bugs/?37564 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors http://stackoverflow.com/questions/27054765/wget-japanese-characters http://www.win.tue.nl/~aeb/linux/misc/wget.html etc. There's no argument that wget currently doesn't cope well with these cases. The issue being discussed is what should it do instead. It is debatable what precisely would be the right thing, but my patch greatly increases the number of happy users. AFAIU, it does that only when the target locale is UTF-8. By using iconv we can make wget DTRT in more locales. For example, nothing was changed yet for Windows, but also Windows users complain about this wget escaping. If we convert the file names using iconv, Windows users will also be happier, at least when the remote URL can be encoded in their system codepage. (To support characters outside of the system codepage, deeper changes are needed in the Windows build of wget, for the reasons I explained elsewhere in this thread.)
Re: [Bug-wget] bad filenames (again)
On Mon, Aug 17, 2015 at 05:39:34AM +0300, Eli Zaretskii wrote: (i) [about using setlocale] First, relying on UTF-8 locale to be announced in the environment is less portable than it could be: it's better to call 'setlocale' Then ... at least Cygwin will not be excluded from this feature. I left the wget behaviour for MSDOS / Windows / Cygwin unchanged because I do not know anything about these platforms. These systems don't normally have the LC_* environment variables, and their 'setlocale' (with the exception of Cygwin) does not look at those variables. But you _can_ obtain the current locale on all supported systems by calling 'setlocale'. Good. Then perhaps using setlocale would be better. I will not do so - do not feel confident on the Windows platform. After all, the goal is not to find out what locale we are in, but to find out whether it might be a good idea to escape certain bytes in a filename. The original author's code was based on the idea that the system is using an ISO-8859-n character set. On Windows I guess that FAT filesystems will use some code page, and NTFS filesystems will use Unicode. If that is correct, then perhaps it never makes sense to do this escape of high control bytes on a Windows system. [So, I conjecture that we could make Windows users happy by replacing /* insert some test for Windows */ by return true; (and updating the functionname).] (ii) [about possibly using iconv] How do you guess the original character set? Since you pass silently over this point, it seems there is no good way to involve iconv. This is a philosophical question: is a Cyrillic file name encoded in koi8-r and the same name encoded in UTF-8 a modified data or the same data expressed in different codesets. Unix filenames are not necessarily in any particular character set. They are sequences of bytes different from NUL and '/'. A different sequence of bytes is a different filename. Also, the same name encoded in UTF-8 is an optimistic description. Should the Unicode be NFC? Or NFD? MacOS has a third version. Even if the filename had a well-defined and known character set, conversion to UTF-8 is not uniquely defined. So, it seems to me that one cannot use iconv unless --remote-encoding and --local-encoding have been specified by the user. And if that is the case, then perhaps iconv is already invoked (in the iri code; I have not checked the details). Andries
Re: [Bug-wget] bad filenames (again)
On Thursday 13 August 2015 19:10:41 Andries E. Brouwer wrote: On Thu, Aug 13, 2015 at 05:54:57PM +0200, Tim Ruehsen wrote: I just made up a test case, but can't apply your patch. Please rebase to latest git master and generate your patch with git format-patch and send it as attachment. Thanks. OK, see attached. Andries Based on that, and your proposal about the progress bar, I made up a bunch of patches. The new test case is not yet ready. @Andries: Maybe you can put a few more test cases into that (or send me a few examples that should work). I also would like to see broken UTF-8 sequences in this test. @Darshit Could you have a closer look into the patches, please ? Neither is python nor the progress code my playground... you are the expert here. Tim From 1ae1aeda78d83e570fe7ee5881c7e9caf182e991 Mon Sep 17 00:00:00 2001 From: Andries E. Brouwer a...@cwi.nl Date: Thu, 13 Aug 2015 19:06:03 +0200 Subject: [PATCH 1/4] Do not escape high control bytes on a UTF-8 system. --- src/init.c| 26 +- src/options.h | 1 + src/url.c | 12 +--- 3 files changed, 35 insertions(+), 4 deletions(-) diff --git a/src/init.c b/src/init.c index ea074cc..6f71de1 100644 --- a/src/init.c +++ b/src/init.c @@ -348,6 +348,27 @@ command_by_name (const char *cmdname) return -1; } + +/* Used to determine whether bytes 128-159 are OK in a filename */ +static int +have_utf8_locale() { +#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) + /* insert some test for Windows */ +#else + char *p; + + p = getenv(LC_ALL); + if (p == NULL) +p = getenv(LC_CTYPE); + if (p == NULL) +p = getenv(LANG); + if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL || + strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL) +return true; +#endif + return false; +} + /* Reset the variables to default values. */ void defaults (void) @@ -419,6 +440,7 @@ defaults (void) opt.restrict_files_os = restrict_unix; #endif opt.restrict_files_ctrl = true; + opt.restrict_files_highctrl = (have_utf8_locale() ? false : true); opt.restrict_files_nonascii = false; opt.restrict_files_case = restrict_no_case_restriction; @@ -1487,6 +1509,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno { int restrict_os = opt.restrict_files_os; int restrict_ctrl = opt.restrict_files_ctrl; + int restrict_highctrl = opt.restrict_files_highctrl; int restrict_case = opt.restrict_files_case; int restrict_nonascii = opt.restrict_files_nonascii; @@ -1511,7 +1534,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno else if (VAL_IS (uppercase)) restrict_case = restrict_uppercase; else if (VAL_IS (nocontrol)) -restrict_ctrl = false; +restrict_ctrl = restrict_highctrl = false; else if (VAL_IS (ascii)) restrict_nonascii = true; else @@ -1532,6 +1555,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno opt.restrict_files_os = restrict_os; opt.restrict_files_ctrl = restrict_ctrl; + opt.restrict_files_highctrl = restrict_highctrl; opt.restrict_files_case = restrict_case; opt.restrict_files_nonascii = restrict_nonascii; diff --git a/src/options.h b/src/options.h index 24ddbb5..083d16b 100644 --- a/src/options.h +++ b/src/options.h @@ -251,6 +251,7 @@ struct options bool restrict_files_ctrl; /* non-zero if control chars in URLs are restricted from appearing in generated file names. */ + bool restrict_files_highctrl; /* idem for bytes 128-159 */ bool restrict_files_nonascii; /* non-zero if bytes with values greater than 127 are restricted. */ enum { diff --git a/src/url.c b/src/url.c index 73c8dd0..e98bfaa 100644 --- a/src/url.c +++ b/src/url.c @@ -1348,7 +1348,8 @@ enum { filechr_not_unix= 1, /* unusable on Unix, / and \0 */ filechr_not_vms = 2, /* unusable on VMS (ODS5), 0x00-0x1F * ? */ filechr_not_windows = 4, /* unusable on Windows, one of \|/?:* */ - filechr_control = 8 /* a control character, e.g. 0-31 */ + filechr_control = 8, /* a control character, e.g. 0-31 */ + filechr_highcontrol = 16 /* a high control character, in 128-159 */ }; #define FILE_CHAR_TEST(c, mask) \ @@ -1360,6 +1361,7 @@ enum { #define V filechr_not_vms #define W filechr_not_windows #define C filechr_control +#define Z filechr_highcontrol #define UVWC U|V|W|C #define UW U|W @@ -1392,8 +1394,8 @@ UVWC, VC, VC, VC, VC, VC, VC, VC, /* NUL SOH STX ETX EOT ENQ ACK BEL */ 0, 0, 0, 0, 0, 0, 0, 0, /* p q r st u v w */ 0, 0, 0, 0, W, 0, 0, C, /* x y z {| } ~ DEL */ - C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, /* 128-143 */ - C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, /*
Re: [Bug-wget] bad filenames (again)
On Mon, Aug 17, 2015 at 01:17:06PM +0200, Tim Ruehsen wrote: @Andries: Maybe you can put a few more test cases into that (or send me a few examples that should work). I also would like to see broken UTF-8 sequences in this test. By some coincidence NoëlKöthe just sent a bug report that provides one more test case. Fetch http://zh.wikipedia.org/wiki/%E9%A6%96%E9%A1%B5. One hopes to get a file with file name 首页, that is, with bytes e9 a6 96 e9 a1 b5, and that is what the patched wget gives. The unpatched wget makes it (unpronounceable) with bytes e9 a6 25 39 36 e9 a1 b5 (because the byte 96 was escaped into %96). Andries [Here it is clear what one wants. In examples with broken UTF-8 sequences, something will happen as a result of the present code. It is unclear whether we want that or not. Changing the filename is bad, but illegal utf-8 is also bad. Today I prefer the unchanged filename, but see no need for a test that checks that we really get that.]
Re: [Bug-wget] bad filenames (again)
On Mon, Aug 17, 2015 at 06:27:05PM +0300, Eli Zaretskii wrote: (ii) [about possibly using iconv] How do you guess the original character set? The answer is call nl_langinfo (CODESET). I think we are not communicating. wget fetches a file from a remote machine. We know the filename (as a sequence of bytes). As far as I can see, there is no information on what character set (if any) that sequence of bytes might be in. In order to call iconv, I need a from-charset and a to-charset. I think your answer tells me how to find a reasonable to-charset. But the problem is how to find a from-charset. [Even when from-charset and to-charset are known there is a can of worms involved in conversion. But without from-charset one cannot even start thinking about conversion.] Unix filenames are not necessarily in any particular character set. They are sequences of bytes different from NUL and '/'. A different sequence of bytes is a different filename. As long as you treat them as UTF-8 encoded strings, ... I don't understand how one can treat sequences of bytes that are not valid UTF-8 as UTF-8 encoded strings. If all the world is UTF-8 then fine. But the remote machine is an unknown system. We just have a byte sequence, that is all. Andries
Re: [Bug-wget] bad filenames (again)
Date: Mon, 17 Aug 2015 12:59:05 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org On Mon, Aug 17, 2015 at 05:39:34AM +0300, Eli Zaretskii wrote: (i) [about using setlocale] First, relying on UTF-8 locale to be announced in the environment is less portable than it could be: it's better to call 'setlocale' Then ... at least Cygwin will not be excluded from this feature. I left the wget behaviour for MSDOS / Windows / Cygwin unchanged because I do not know anything about these platforms. These systems don't normally have the LC_* environment variables, and their 'setlocale' (with the exception of Cygwin) does not look at those variables. But you _can_ obtain the current locale on all supported systems by calling 'setlocale'. Good. Then perhaps using setlocale would be better. I will not do so - do not feel confident on the Windows platform. You don't need to -- do it on your OS, and the same will work elsewhere. After all, the goal is not to find out what locale we are in, but to find out whether it might be a good idea to escape certain bytes in a filename. Indeed, you want the current locale's codeset, see below. On Windows I guess that FAT filesystems will use some code page, and NTFS filesystems will use Unicode. Not exactly. The functions that emulate Posix and accept file names as char * strings cannot use Unicode on Windows, because using Unicode means using wchar_t strings instead. So, unless Someone™ changes wget to do that, at least on Windows, the Windows port will still use the current system codepage, even on NTFS, because that's what functions like 'fopen', 'open', 'stat', etc. assume. (ii) [about possibly using iconv] How do you guess the original character set? Since you pass silently over this point No, I just missed that, sorry. The answer is call nl_langinfo (CODESET). Windows doesn't have 'nl_langinfo', but it is easily emulated with more or less a single API call, or we could use the Gnulib replacement (which already does support Windows). it seems there is no good way to involve iconv. Actually, there's no problem, see above. Many programs do it like that already. This is a philosophical question: is a Cyrillic file name encoded in koi8-r and the same name encoded in UTF-8 a modified data or the same data expressed in different codesets. Unix filenames are not necessarily in any particular character set. They are sequences of bytes different from NUL and '/'. A different sequence of bytes is a different filename. As long as you treat them as UTF-8 encoded strings, they are, for all practical purposes, in the Unicode character set. (Which, btw, brings up the question what to do if the UTF-8 sequence is for u+FFFD or is simply invalid -- do we treat them as control characters or don't we?) Also, the same name encoded in UTF-8 is an optimistic description. Should the Unicode be NFC? Or NFD? MacOS has a third version. It doesn't matter, since any filesystem worth its sectors will DTRT and any ls-like program will, too, and will show you a perfectly legible file name. Even if the filename had a well-defined and known character set, conversion to UTF-8 is not uniquely defined. Do whatever iconv does, and we will be fine.
Re: [Bug-wget] bad filenames (again)
Date: Mon, 17 Aug 2015 19:58:31 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org On Mon, Aug 17, 2015 at 06:27:05PM +0300, Eli Zaretskii wrote: (ii) [about possibly using iconv] How do you guess the original character set? The answer is call nl_langinfo (CODESET). I think we are not communicating. wget fetches a file from a remote machine. We know the filename (as a sequence of bytes). As far as I can see, there is no information on what character set (if any) that sequence of bytes might be in. Then please explain why you started this thread by saying that the byte sequence should end up unaltered in the filesystem (and wrote the patch to do the same, AFAIU) if the target's locale uses UTF-8 as its encoding. What do you expect the file names to look like in 'ls' or anything similar, after doing that? In order to call iconv, I need a from-charset and a to-charset. I think your answer tells me how to find a reasonable to-charset. But the problem is how to find a from-charset. I thought the from-charset was UTF-8, or at least you assumed that. If it isn't, I see even less sense in the idea of your patch, which is basically writing the bytes unaltered. Don't we want to try to have on the target the same file names as on the source? If not, what do we want to achieve here, and why is what wget did before your patch the wrong thing? [Even when from-charset and to-charset are known there is a can of worms involved in conversion. No can of worms that I could see. Either the conversion succeeds, or it fails. You get a clear indication from iconv about that. Unix filenames are not necessarily in any particular character set. They are sequences of bytes different from NUL and '/'. A different sequence of bytes is a different filename. As long as you treat them as UTF-8 encoded strings, ... I don't understand how one can treat sequences of bytes that are not valid UTF-8 as UTF-8 encoded strings. If all the world is UTF-8 then fine. But the remote machine is an unknown system. We just have a byte sequence, that is all. If we know nothing about the source encoding, then the only sane thing is to always hex-encode characters with 8th bit set. But that's not what your patch does. It writes the byte stream verbatim to the filesystem if the target locale uses UTF-8 as its codeset. Please explain the logic behind this, because I don't see it.
Re: [Bug-wget] bad filenames (again)
Date: Thu, 13 Aug 2015 19:10:41 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: bug-wget@gnu.org, Andries E. Brouwer andries.brou...@cwi.nl +/* Used to determine whether bytes 128-159 are OK in a filename */ +static int +have_utf8_locale() { +#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) + /* insert some test for Windows */ +#else + char *p; + + p = getenv(LC_ALL); + if (p == NULL) +p = getenv(LC_CTYPE); + if (p == NULL) +p = getenv(LANG); + if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL || + strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL) +return true; +#endif + return false; +} [...] + opt.restrict_files_highctrl = (have_utf8_locale() ? false : true); I'm not sure this is the right way to fix this. First, relying on UTF-8 locale to be announced in the environment is less portable than it could be: it's better to call 'setlocale' with the 2nd argument NULL to glean the same information. Then the ugly #ifdef above could be dropped, and at least Cygwin will not be excluded from this feature. Moreover, even if the locale is not UTF-8, wget should attempt to convert the file names to the current locale using iconv (which I believe was what Tim suggested). This will DTRT in much more cases than the above UTF-8 centric approach, IMO. Thanks.
Re: [Bug-wget] bad filenames (again)
Date: Sun, 16 Aug 2015 22:21:20 +0200 From: Andries E. Brouwer andries.brou...@cwi.nl Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de, bug-wget@gnu.org On Sun, Aug 16, 2015 at 05:43:50PM +0300, Eli Zaretskii wrote: (i) #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) /* insert some test for Windows */ #else ... code that uses getenv to test LC_ALL, LC_CTYPE, LANG ... #endif I'm not sure this is the right way to fix this. First, relying on UTF-8 locale to be announced in the environment is less portable than it could be: it's better to call 'setlocale' with the 2nd argument NULL to glean the same information. Then the ugly #ifdef above could be dropped, and at least Cygwin will not be excluded from this feature. I left the wget behaviour for MSDOS / Windows / Cygwin unchanged because I do not know anything about these platforms. It is quite possible that the #ifdef is unneeded. Are you saying that it in fact is needed when getenv() is used, but unneeded when setlocale() is used? Yes. These systems don't normally have the LC_* environment variables, and their 'setlocale' (with the exception of Cygwin) does not look at those variables. But you _can_ obtain the current locale on all supported systems by calling 'setlocale'. And then what about LANG? What about it? You can test it in the environment, if you want, but IMO it's unnecessary, since either 'setlocale' already does, or the variable is not relevant to the issue at hand. (You need the codeset, not the language.) Moreover, even if the locale is not UTF-8, wget should attempt to convert the file names to the current locale using iconv (which I believe was what Tim suggested). This will DTRT in much more cases than the above UTF-8 centric approach, IMO. Hmm. My own point of view is almost the opposite. In my life I have spent countless hours trying to repair the damage done by software that helpfully modified my data. I prefer my data as-is, unless I explicitly ask for conversion. This is a philosophical question: is a Cyrillic file name encoded in koi8-r and the same name encoded in UTF-8 a modified data or the same data expressed in different codesets. Converting encoding as required by the locale is the expected behavior. Windows, for example, does that automatically (if possible). The patch enlarges the number of cases where the original data is preserved. Yes, I am all in favour of enlarging that number of cases even further. This is only a first step. But in my eyes applying iconv would be a step back. It can be really tricky to decode the mojibake obtained by converting A to C, while the original really was in B. If iconv succeeds to convert, you won't see any mojibake to begin with. If it fails, then yes, the conversion should be abandoned. What should happen when iconv() returns EILSEQ? Turn on the restrict_files_highctrl option, like you do now.
Re: [Bug-wget] bad filenames (again)
On Sun, Aug 16, 2015 at 05:43:50PM +0300, Eli Zaretskii wrote: (i) #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) /* insert some test for Windows */ #else ... code that uses getenv to test LC_ALL, LC_CTYPE, LANG ... #endif I'm not sure this is the right way to fix this. First, relying on UTF-8 locale to be announced in the environment is less portable than it could be: it's better to call 'setlocale' with the 2nd argument NULL to glean the same information. Then the ugly #ifdef above could be dropped, and at least Cygwin will not be excluded from this feature. I left the wget behaviour for MSDOS / Windows / Cygwin unchanged because I do not know anything about these platforms. It is quite possible that the #ifdef is unneeded. Are you saying that it in fact is needed when getenv() is used, but unneeded when setlocale() is used? And then what about LANG? (ii) Moreover, even if the locale is not UTF-8, wget should attempt to convert the file names to the current locale using iconv (which I believe was what Tim suggested). This will DTRT in much more cases than the above UTF-8 centric approach, IMO. Hmm. My own point of view is almost the opposite. In my life I have spent countless hours trying to repair the damage done by software that helpfully modified my data. I prefer my data as-is, unless I explicitly ask for conversion. I think Tim suggested something else (namely, just checking whether the filename was valid UTF-8), but never mind. The patch enlarges the number of cases where the original data is preserved. Yes, I am all in favour of enlarging that number of cases even further. This is only a first step. But in my eyes applying iconv would be a step back. It can be really tricky to decode the mojibake obtained by converting A to C, while the original really was in B. How do you guess the original character set? What should happen when iconv() returns EILSEQ? Andries
Re: [Bug-wget] bad filenames (again)
I guess this issue is now closed? We should document libgpgme11-dev as a dependency. On Fri, Aug 14, 2015 at 1:38 AM, Tim Rühsen tim.rueh...@gmx.de wrote: Am Donnerstag, 13. August 2015, 19:33:56 schrieb Andries E. Brouwer: After git clone, one gets a wget tree without autogenerated files. README.checkout tells one to run ./bootstrap to generate configure. But: $ ./bootstrap ./bootstrap: Bootstrapping from checked-out wget sources... ./bootstrap: consider installing git-merge-changelog from gnulib ./bootstrap: getting gnulib files... ... running: AUTOPOINT=true LIBTOOLIZE=true autoreconf --verbose --install --force -I m4 --no-recursive autoreconf: Entering directory `.' autoreconf: running: true --force autoreconf: running: aclocal -I m4 --force -I m4 configure.ac:498: warning: macro 'AM_PATH_GPGME' not found in library autoreconf: configure.ac: tracing autoreconf: configure.ac: not using Libtool autoreconf: running: /usr/bin/autoconf --include=m4 --force configure.ac:93: error: possibly undefined macro: AC_DEFINE If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. configure.ac:498: error: possibly undefined macro: AM_PATH_GPGME autoreconf: /usr/bin/autoconf failed with exit status: 1 ./bootstrap: autoreconf failed Yes sorry, that is a recent issue with metalink. Darshit works on that. You have to install libgpgme11-dev (Or similar name). Tim -- Thanking You, Darshit Shah From b495c71adc88642d06f141c612f82ba10bdb7ee1 Mon Sep 17 00:00:00 2001 From: Darshit Shah dar...@gmail.com Date: Sat, 15 Aug 2015 12:22:33 +0530 Subject: [PATCH] Document dependency on libgpgme11-dev * README.checkout: Document dependency on libgpgme11-dev required by the metalink code. --- README.checkout | 5 + 1 file changed, 5 insertions(+) diff --git a/README.checkout b/README.checkout index 03463d1..eff6abc 100644 --- a/README.checkout +++ b/README.checkout @@ -94,6 +94,10 @@ Compiling From Repository Sources saved the .pc file. Example: $ PKG_CONFIG_PATH=. ./configure +* [46]libgpgme11-dev is required to compile with support for metalink files + and GPGME support. Metalink requires this library to verify the integrity + of the download. + For those who might be confused as to what to do once they check out the source code, considering configure and Makefile do not yet exist at @@ -200,3 +204,4 @@ References 43. http://validator.w3.org/check?uri=referer 44. http://wget.addictivecode.org/WikiLicense 45. https://www.python.org/ + 46. https://www.gnupg.org/%28it%29/related_software/gpgme/index.html -- 2.5.0
Re: [Bug-wget] bad filenames (again)
On Thu, Aug 13, 2015 at 05:54:57PM +0200, Tim Ruehsen wrote: I just made up a test case, but can't apply your patch. Please rebase to latest git master and generate your patch with git format-patch and send it as attachment. Thanks. OK, see attached. Andries From 5980a3665d8924c7d2374f0740bb82ff0cdc9043 Mon Sep 17 00:00:00 2001 From: Andries E. Brouwer a...@cwi.nl Date: Thu, 13 Aug 2015 19:06:03 +0200 Subject: [PATCH] Do not escape high control bytes on a UTF-8 system. --- src/init.c| 26 +- src/options.h | 1 + src/url.c | 12 +--- 3 files changed, 35 insertions(+), 4 deletions(-) diff --git a/src/init.c b/src/init.c index ea074cc..6f71de1 100644 --- a/src/init.c +++ b/src/init.c @@ -348,6 +348,27 @@ command_by_name (const char *cmdname) return -1; } + +/* Used to determine whether bytes 128-159 are OK in a filename */ +static int +have_utf8_locale() { +#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) + /* insert some test for Windows */ +#else + char *p; + + p = getenv(LC_ALL); + if (p == NULL) +p = getenv(LC_CTYPE); + if (p == NULL) +p = getenv(LANG); + if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL || + strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL) +return true; +#endif + return false; +} + /* Reset the variables to default values. */ void defaults (void) @@ -419,6 +440,7 @@ defaults (void) opt.restrict_files_os = restrict_unix; #endif opt.restrict_files_ctrl = true; + opt.restrict_files_highctrl = (have_utf8_locale() ? false : true); opt.restrict_files_nonascii = false; opt.restrict_files_case = restrict_no_case_restriction; @@ -1487,6 +1509,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno { int restrict_os = opt.restrict_files_os; int restrict_ctrl = opt.restrict_files_ctrl; + int restrict_highctrl = opt.restrict_files_highctrl; int restrict_case = opt.restrict_files_case; int restrict_nonascii = opt.restrict_files_nonascii; @@ -1511,7 +1534,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno else if (VAL_IS (uppercase)) restrict_case = restrict_uppercase; else if (VAL_IS (nocontrol)) -restrict_ctrl = false; +restrict_ctrl = restrict_highctrl = false; else if (VAL_IS (ascii)) restrict_nonascii = true; else @@ -1532,6 +1555,7 @@ cmd_spec_restrict_file_names (const char *com, const char *val, void *place_igno opt.restrict_files_os = restrict_os; opt.restrict_files_ctrl = restrict_ctrl; + opt.restrict_files_highctrl = restrict_highctrl; opt.restrict_files_case = restrict_case; opt.restrict_files_nonascii = restrict_nonascii; diff --git a/src/options.h b/src/options.h index 24ddbb5..083d16b 100644 --- a/src/options.h +++ b/src/options.h @@ -251,6 +251,7 @@ struct options bool restrict_files_ctrl; /* non-zero if control chars in URLs are restricted from appearing in generated file names. */ + bool restrict_files_highctrl; /* idem for bytes 128-159 */ bool restrict_files_nonascii; /* non-zero if bytes with values greater than 127 are restricted. */ enum { diff --git a/src/url.c b/src/url.c index 73c8dd0..e98bfaa 100644 --- a/src/url.c +++ b/src/url.c @@ -1348,7 +1348,8 @@ enum { filechr_not_unix= 1, /* unusable on Unix, / and \0 */ filechr_not_vms = 2, /* unusable on VMS (ODS5), 0x00-0x1F * ? */ filechr_not_windows = 4, /* unusable on Windows, one of \|/?:* */ - filechr_control = 8 /* a control character, e.g. 0-31 */ + filechr_control = 8, /* a control character, e.g. 0-31 */ + filechr_highcontrol = 16 /* a high control character, in 128-159 */ }; #define FILE_CHAR_TEST(c, mask) \ @@ -1360,6 +1361,7 @@ enum { #define V filechr_not_vms #define W filechr_not_windows #define C filechr_control +#define Z filechr_highcontrol #define UVWC U|V|W|C #define UW U|W @@ -1392,8 +1394,8 @@ UVWC, VC, VC, VC, VC, VC, VC, VC, /* NUL SOH STX ETX EOT ENQ ACK BEL */ 0, 0, 0, 0, 0, 0, 0, 0, /* p q r st u v w */ 0, 0, 0, 0, W, 0, 0, C, /* x y z {| } ~ DEL */ - C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, /* 128-143 */ - C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, /* 144-159 */ + Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, /* 128-143 */ + Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, /* 144-159 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, @@ -1406,6 +1408,7 @@ UVWC, VC, VC, VC, VC, VC, VC, VC, /* NUL SOH STX ETX EOT ENQ ACK BEL */ #undef V #undef W #undef C +#undef Z #undef UW #undef UVWC #undef VC @@ -1448,8 +1451,11 @@ append_uri_pathel (const char *b, const char *e, bool
Re: [Bug-wget] bad filenames (again)
Am Donnerstag, 13. August 2015, 19:33:56 schrieb Andries E. Brouwer: After git clone, one gets a wget tree without autogenerated files. README.checkout tells one to run ./bootstrap to generate configure. But: $ ./bootstrap ./bootstrap: Bootstrapping from checked-out wget sources... ./bootstrap: consider installing git-merge-changelog from gnulib ./bootstrap: getting gnulib files... ... running: AUTOPOINT=true LIBTOOLIZE=true autoreconf --verbose --install --force -I m4 --no-recursive autoreconf: Entering directory `.' autoreconf: running: true --force autoreconf: running: aclocal -I m4 --force -I m4 configure.ac:498: warning: macro 'AM_PATH_GPGME' not found in library autoreconf: configure.ac: tracing autoreconf: configure.ac: not using Libtool autoreconf: running: /usr/bin/autoconf --include=m4 --force configure.ac:93: error: possibly undefined macro: AC_DEFINE If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. configure.ac:498: error: possibly undefined macro: AM_PATH_GPGME autoreconf: /usr/bin/autoconf failed with exit status: 1 ./bootstrap: autoreconf failed Yes sorry, that is a recent issue with metalink. Darshit works on that. You have to install libgpgme11-dev (Or similar name). Tim
Re: [Bug-wget] bad filenames (again)
Hi Andries, I just made up a test case, but can't apply your patch. Please rebase to latest git master and generate your patch with git format-patch and send it as attachment. Thanks. Regards, Tim On Wednesday 12 August 2015 19:36:52 Andries E. Brouwer wrote: On Wed, Aug 12, 2015 at 05:54:25PM +0200, Tim Ruehsen wrote: OK. Let's set up a test where we define input and expected output. If that works, I am fine. OK. I mentioned a Hebrew example, but in order to avoid the additional difficulty of bidi text, let me find a Russian example instead. % wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5 Saving to: ‘Се\321%80д\321%86е’ % my_wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5 Saving to: ‘Сердце’ (This is the Russian Wikipedia page for 'heart'). Andries --- BTW - now that I tried this: the progress bar contains an ugly symbol. Looking at progress.c I see int padding = MAX_FILENAME_COLS - orig_filename_cols; sprintf (p, %s , bp-f_download); p += orig_filename_cols + 1; for (;padding;padding--) *p++ = ' '; but orig_filename_cols was computed correctly, counting character positions, not bytes, and the p += orig_filename_cols + 1; is a bug. The ugly symbol is because a multibyte character was truncated. If I write sprintf (p, %s , bp-f_download); p += strlen(bp-f_download) + 1; while (p bp-buffer + MAX_FILENAME_COLS) *p++ = ' '; instead, then the progress bar text looks right in this particular case. I have not yet read the surrounding code.
Re: [Bug-wget] bad filenames (again)
On Wednesday 12 August 2015 14:38:15 Andries E. Brouwer wrote: Hi Tim, Just a few questions. 1. Why don't you use 'opt.locale' to check if the local encoding is UTF-8 ? I thought that was usable only if ENABLE_IRI was defined. I see. ENABLE_IRI, libiconv (for conversion) and libidn (used for setting opt.locale) are tightly coupled. Understandable that you won't go into that swamp. 2. I don't understand how you distinguish between illegal and legal UTF-8 sequences. I guess only legal sequences should be unescaped. Or to make it easy: if the string is valid UTF-8, do not escape. If it is not valid UTF-8, escape it. You could: Add unistr/u8-check to bootstrap.conf (./bootstrap thereafter), include #include unistr.h and use if (u8_check (s, strlen(s)) == 0) to test for validity. Yes, I expected you to say something like this. My reason: I consider this escaping a very doubtful activity. In my eyes the correct code is not: always escape except when UTF-8, but rather: never escape except perhaps when someone asks for it. So the precise check for UTF-8 is in my eyes just bloat. Of course, only when someone asks (in this special case). But the user should *really* know what he is doing, else the requested 'not- escaping' becomes an epic fail. Moreover: what to do if the name is not valid UTF-8? The current escaping produces something that not valid UTF-8. So doing the current escaping is certainly a mistake, not better than using the name as-is. Invent a new type of escaping? The procedure should be (simplified): When extracting an URL from a document, we know it's encoding. When we generate a filename from this URL we should (and can) convert to local encoding first, then generate the filename. If this fails (likely iconv() problem), we start escaping regarding the user's wish (except the user does explicitly not want escaping). So, for the time being, my previous patch avoided the old mistake, without introducing new mistakes :-). OK. Let's set up a test where we define input and expected output. If that works, I am fine. Regards, Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] bad filenames (again)
Hi Tim, Just a few questions. 1. Why don't you use 'opt.locale' to check if the local encoding is UTF-8 ? I thought that was usable only if ENABLE_IRI was defined. 2. I don't understand how you distinguish between illegal and legal UTF-8 sequences. I guess only legal sequences should be unescaped. Or to make it easy: if the string is valid UTF-8, do not escape. If it is not valid UTF-8, escape it. You could: Add unistr/u8-check to bootstrap.conf (./bootstrap thereafter), include #include unistr.h and use if (u8_check (s, strlen(s)) == 0) to test for validity. Yes, I expected you to say something like this. My reason: I consider this escaping a very doubtful activity. In my eyes the correct code is not: always escape except when UTF-8, but rather: never escape except perhaps when someone asks for it. So the precise check for UTF-8 is in my eyes just bloat. Moreover: what to do if the name is not valid UTF-8? The current escaping produces something that not valid UTF-8. So doing the current escaping is certainly a mistake, not better than using the name as-is. Invent a new type of escaping? So, for the time being, my previous patch avoided the old mistake, without introducing new mistakes :-). Andries
Re: [Bug-wget] bad filenames (again)
On Wed, Aug 12, 2015 at 05:54:25PM +0200, Tim Ruehsen wrote: OK. Let's set up a test where we define input and expected output. If that works, I am fine. OK. I mentioned a Hebrew example, but in order to avoid the additional difficulty of bidi text, let me find a Russian example instead. % wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5 Saving to: ‘Се\321%80д\321%86е’ % my_wget https://ru.wikipedia.org/wiki/%D0%A1%D0%B5%D1%80%D0%B4%D1%86%D0%B5 Saving to: ‘Сердце’ (This is the Russian Wikipedia page for 'heart'). Andries --- BTW - now that I tried this: the progress bar contains an ugly symbol. Looking at progress.c I see int padding = MAX_FILENAME_COLS - orig_filename_cols; sprintf (p, %s , bp-f_download); p += orig_filename_cols + 1; for (;padding;padding--) *p++ = ' '; but orig_filename_cols was computed correctly, counting character positions, not bytes, and the p += orig_filename_cols + 1; is a bug. The ugly symbol is because a multibyte character was truncated. If I write sprintf (p, %s , bp-f_download); p += strlen(bp-f_download) + 1; while (p bp-buffer + MAX_FILENAME_COLS) *p++ = ' '; instead, then the progress bar text looks right in this particular case. I have not yet read the surrounding code.
Re: [Bug-wget] bad filenames (again)
On Fri, Aug 07, 2015 at 05:13:19PM +0200, Tim Ruehsen wrote: The solution would something like if locale is UTF-8 do not escape valid UTF-8 sequences else keep wget's current behavior If you provide patch for this we will appreciate that. OK - a first version of such a patch. This splits the restrict_control into two halves. The low control is as before. The high control is permitted by default on a Unix system with something that looks like an UTF-8 locale. For Windows the behavior is unchanged. Andries Test: fetch http://he.wikipedia.org/wiki/הרפש_.ש diff -ru wget-1.16.3/src/init.c wget-1.16.3a/src/init.c --- wget-1.16.3/src/init.c 2015-01-31 00:25:57.0 +0100 +++ wget-1.16.3a/src/init.c 2015-08-09 21:44:54.260215105 +0200 @@ -333,6 +333,27 @@ return -1; } + +/* Used to determine whether bytes 128-159 are OK in a filename */ +static int +have_utf8_locale() { +#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) + /* insert some test for Windows */ +#else + char *p; + + p = getenv(LC_ALL); + if (p == NULL) +p = getenv(LC_CTYPE); + if (p == NULL) +p = getenv(LANG); + if (strstr(p, UTF-8) != NULL || strstr(p, UTF8) != NULL || + strstr(p, utf-8) != NULL || strstr(p, utf8) != NULL) +return true; +#endif + return false; +} + /* Reset the variables to default values. */ void defaults (void) @@ -401,6 +422,7 @@ opt.restrict_files_os = restrict_unix; #endif opt.restrict_files_ctrl = true; + opt.restrict_files_highctrl = (have_utf8_locale() ? false : true); opt.restrict_files_nonascii = false; opt.restrict_files_case = restrict_no_case_restriction; @@ -1466,6 +1488,7 @@ { int restrict_os = opt.restrict_files_os; int restrict_ctrl = opt.restrict_files_ctrl; + int restrict_highctrl = opt.restrict_files_highctrl; int restrict_case = opt.restrict_files_case; int restrict_nonascii = opt.restrict_files_nonascii; @@ -1488,7 +1511,7 @@ else if (VAL_IS (uppercase)) restrict_case = restrict_uppercase; else if (VAL_IS (nocontrol)) -restrict_ctrl = false; +restrict_ctrl = restrict_highctrl = false; else if (VAL_IS (ascii)) restrict_nonascii = true; else @@ -1509,6 +1532,7 @@ opt.restrict_files_os = restrict_os; opt.restrict_files_ctrl = restrict_ctrl; + opt.restrict_files_highctrl = restrict_highctrl; opt.restrict_files_case = restrict_case; opt.restrict_files_nonascii = restrict_nonascii; diff -ru wget-1.16.3/src/options.h wget-1.16.3a/src/options.h --- wget-1.16.3/src/options.h 2015-01-31 00:25:57.0 +0100 +++ wget-1.16.3a/src/options.h 2015-08-09 21:22:35.984186065 +0200 @@ -244,6 +244,7 @@ bool restrict_files_ctrl; /* non-zero if control chars in URLs are restricted from appearing in generated file names. */ + bool restrict_files_highctrl; /* idem for bytes 128-159 */ bool restrict_files_nonascii; /* non-zero if bytes with values greater than 127 are restricted. */ enum { diff -ru wget-1.16.3/src/url.c wget-1.16.3a/src/url.c --- wget-1.16.3/src/url.c 2015-02-23 16:10:22.0 +0100 +++ wget-1.16.3a/src/url.c 2015-08-09 21:14:34.876175626 +0200 @@ -1329,7 +1329,8 @@ enum { filechr_not_unix= 1, /* unusable on Unix, / and \0 */ filechr_not_windows = 2, /* unusable on Windows, one of \|/?:* */ - filechr_control = 4 /* a control character, e.g. 0-31 */ + filechr_control = 4, /* a control character, e.g. 0-31 */ + filechr_highcontrol = 8 /* a high control character, in 128-159 */ }; #define FILE_CHAR_TEST(c, mask) \ @@ -1340,6 +1341,7 @@ #define U filechr_not_unix #define W filechr_not_windows #define C filechr_control +#define Z filechr_highcontrol #define UW U|W #define UWC U|W|C @@ -1370,8 +1372,8 @@ 0, 0, 0, 0, 0, 0, 0, 0, /* p q r st u v w */ 0, 0, 0, 0, W, 0, 0, C, /* x y z {| } ~ DEL */ - C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, /* 128-143 */ - C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, /* 144-159 */ + Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, /* 128-143 */ + Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, /* 144-159 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, @@ -1383,6 +1385,7 @@ #undef U #undef W #undef C +#undef Z #undef UW #undef UWC @@ -1417,8 +1420,11 @@ mask = filechr_not_unix; else mask = filechr_not_windows; + if (opt.restrict_files_ctrl) mask |= filechr_control; + if (opt.restrict_files_highctrl) +mask |= filechr_highcontrol; /* Copy [b, e) to PATHEL and URL-unescape it. */ if (escaped)
Re: [Bug-wget] bad filenames (again)
Hi Andries, as I already mentioned, changing the default behavior of wget is not a good idea. But I started a wget2 branch that produces wget and wget2 executables. wget2's default behavior is to keep filenames as they are. I am not sure how it compiles and works on Windows (Cygwin could work). If you dare to check it out: any feedback is highly welcome. Regards, Tim On Thursday 06 August 2015 23:40:45 Andries E. Brouwer wrote: Today I again downloaded a large tree with wget and got only unusable filenames. Fortunately I have the utility wgetfix that repairs the consequences of this bug (see http://www.win.tue.nl/~aeb/linux/misc/wget.html ), but nevertheless this wget bug should be fixed. (Maybe it has been fixed already? I looked at this in detail last year, and there was some correspondence but I think nothing happened. Have not looked at the latest sources.) What happens is that wget under certain circumstances escapes certain bytes in a filename. I think that this was always a mistake, but it did not occur very much and was defendable: filenames with embedded control characters are a pain. Today the situation is just the opposite: when copying from a remote utf8 system to a local utf8 system correct and normal filenames are escaped to create illegal filenames that cannot be used and are worse than a pain, one cannot do much else than discard them. What can the user do? If she is on Windows, she is told to switch to Linux: I can't help Windows users, but Wget is a power-user tool. And a Windows power-user should be able to start a virtual machine with Linux running to use tools like Wget. Is she is on Linux, the easiest is to discard all that was downloaded and start over again, this time with the option --restrict-file-names=nocontrol If the user knows about wgetfix, that is an alternative. One can also use curl instead of wget. See also http://savannah.gnu.org/bugs/?37564 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors http://stackoverflow.com/questions/27054765/wget-japanese-characters http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-usin g-wget http://www.win.tue.nl/~aeb/linux/misc/wget.html Below I suggested an easy fix, and discussed some details. Andries On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote: On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote: On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote: If I ask wget to download the wikipedia page http://he.wikipedia.org/wiki/ש._שפרה then I hope for a resulting file ש._שפרה. Instead, wget gives me ש._שפר\327%94, where the \327 is an unpronounceable byte that cannot be typed (This is an UTF-8 system and the filename that wget produces is not valid UTF-8.) Maybe it would be better if wget by default used the original filename. This name mangling is a vestige of old times, it seems to me. This is a commonly reported grievance and as you correctly mention a vestige of old times. With UTF-8 supported filesystems, Wget should simply write the correct characters. I sincerely hope this issue is resolved as fast as possible, but I know not how to. Those who understand i18n should work on this. It is very easy to resolve the issue, but I don't know how backwards compatible the wget developers want to be. The easiest solution is to change the line (in init.c:defaults()) opt.restrict_files_ctrl = true; into opt.restrict_files_ctrl = false; That is what I would like to see: the default should be to preserve the name as-is, and there should be options escape_control or so to force the current default behaviour. There are also more complicated solutions. One can ask for LC_CTYPE or LANG or some such thing, and try to find out whether the current system is UTF-8, and only in that case set restrict_files_ctrl to false. I don't know anything about the Windows environment. Andries [Discussion: There is a flag --restrict-file-names. The manual page says By default, Wget escapes the characters that are not valid or safe as part of file names on your operating system, as well as control characters that are typically unprintable. Presently this is false: On a UTF-8 system Wget by default introduces illegal characters. The option nocontrol is needed to preserve the correct name. The flag is handled in init.c:cmd_spec_restrict_file_names() where opt.restrict_files_{os,case,ctrl,nonascii} are set. Of interest is the restrict_files_ctrl flag. Today init.c does by default: #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) opt.restrict_files_os = restrict_windows; #else opt.restrict_files_os = restrict_unix; #endif opt.restrict_files_ctrl = true; opt.restrict_files_nonascii = false;
Re: [Bug-wget] bad filenames (again)
On Friday 07 August 2015 16:38:01 Andries E. Brouwer wrote: On Fri, Aug 07, 2015 at 04:14:45PM +0200, Tim Ruehsen wrote: Hi Andries, as I already mentioned, changing the default behavior of wget is not a good idea. But I started a wget2 branch that produces wget and wget2 executables. wget2's default behavior is to keep filenames as they are. I am not sure how it compiles and works on Windows (Cygwin could work). If you dare to check it out: any feedback is highly welcome. Regards, Tim Hi Tim, I disagree. This is just a bug. Nobody wants illegal filenames. Even removing them is not entirely trivial since the filenames produced by wget are not legal character sequences, so cannot be typed. Hi Andries, obviously I got it wrong. If it's a bug, let's just fix it (without breaking compatibility). I don't have the time to read *all* the old emails right now. But as far as I understand escaping occurs within legal UTF-8 sequences - and you are right when saying this is a bug when we have a UTF-8 locale. The solution would something like if locale is UTF-8 do not escape valid UTF-8 sequences else keep wget's current behavior If URLs (and thus filenames) are not in UTF-8, Wget will convert them to UTF-8 before the above procedure (I guess that is what wget does anyways, well not 100% sure). Would you agree ? If you provide patch for this we will appreciate that. I am a Linux man, no Windows computers here. So, I am happy to do stuff on Linux, but cannot test on Windows. Sorry, won't bother you again regarding Windows ;-) Tim
Re: [Bug-wget] bad filenames (again)
On Fri, Aug 07, 2015 at 04:14:45PM +0200, Tim Ruehsen wrote: Hi Andries, as I already mentioned, changing the default behavior of wget is not a good idea. But I started a wget2 branch that produces wget and wget2 executables. wget2's default behavior is to keep filenames as they are. I am not sure how it compiles and works on Windows (Cygwin could work). If you dare to check it out: any feedback is highly welcome. Regards, Tim Hi Tim, I disagree. This is just a bug. Nobody wants illegal filenames. Even removing them is not entirely trivial since the filenames produced by wget are not legal character sequences, so cannot be typed. So, I think this should be fixed, for example with my one-liner fix, but I am quite happy to do something more complicated if that is what people prefer. I am a Linux man, no Windows computers here. So, I am happy to do stuff on Linux, but cannot test on Windows. Andries
Re: [Bug-wget] bad filenames (again)
Today I again downloaded a large tree with wget and got only unusable filenames. Fortunately I have the utility wgetfix that repairs the consequences of this bug (see http://www.win.tue.nl/~aeb/linux/misc/wget.html ), but nevertheless this wget bug should be fixed. (Maybe it has been fixed already? I looked at this in detail last year, and there was some correspondence but I think nothing happened. Have not looked at the latest sources.) What happens is that wget under certain circumstances escapes certain bytes in a filename. I think that this was always a mistake, but it did not occur very much and was defendable: filenames with embedded control characters are a pain. Today the situation is just the opposite: when copying from a remote utf8 system to a local utf8 system correct and normal filenames are escaped to create illegal filenames that cannot be used and are worse than a pain, one cannot do much else than discard them. What can the user do? If she is on Windows, she is told to switch to Linux: I can't help Windows users, but Wget is a power-user tool. And a Windows power-user should be able to start a virtual machine with Linux running to use tools like Wget. Is she is on Linux, the easiest is to discard all that was downloaded and start over again, this time with the option --restrict-file-names=nocontrol If the user knows about wgetfix, that is an alternative. One can also use curl instead of wget. See also http://savannah.gnu.org/bugs/?37564 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors http://stackoverflow.com/questions/27054765/wget-japanese-characters http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-using-wget http://www.win.tue.nl/~aeb/linux/misc/wget.html Below I suggested an easy fix, and discussed some details. Andries On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote: On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote: On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote: If I ask wget to download the wikipedia page http://he.wikipedia.org/wiki/ש._שפרה then I hope for a resulting file ש._שפרה. Instead, wget gives me ש._שפר\327%94, where the \327 is an unpronounceable byte that cannot be typed (This is an UTF-8 system and the filename that wget produces is not valid UTF-8.) Maybe it would be better if wget by default used the original filename. This name mangling is a vestige of old times, it seems to me. This is a commonly reported grievance and as you correctly mention a vestige of old times. With UTF-8 supported filesystems, Wget should simply write the correct characters. I sincerely hope this issue is resolved as fast as possible, but I know not how to. Those who understand i18n should work on this. It is very easy to resolve the issue, but I don't know how backwards compatible the wget developers want to be. The easiest solution is to change the line (in init.c:defaults()) opt.restrict_files_ctrl = true; into opt.restrict_files_ctrl = false; That is what I would like to see: the default should be to preserve the name as-is, and there should be options escape_control or so to force the current default behaviour. There are also more complicated solutions. One can ask for LC_CTYPE or LANG or some such thing, and try to find out whether the current system is UTF-8, and only in that case set restrict_files_ctrl to false. I don't know anything about the Windows environment. Andries [Discussion: There is a flag --restrict-file-names. The manual page says By default, Wget escapes the characters that are not valid or safe as part of file names on your operating system, as well as control characters that are typically unprintable. Presently this is false: On a UTF-8 system Wget by default introduces illegal characters. The option nocontrol is needed to preserve the correct name. The flag is handled in init.c:cmd_spec_restrict_file_names() where opt.restrict_files_{os,case,ctrl,nonascii} are set. Of interest is the restrict_files_ctrl flag. Today init.c does by default: #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) opt.restrict_files_os = restrict_windows; #else opt.restrict_files_os = restrict_unix; #endif opt.restrict_files_ctrl = true; opt.restrict_files_nonascii = false; opt.restrict_files_case = restrict_no_case_restriction; The value of these flags is used in url.c:append_uri_pathel where FILE_CHAR_TEST (*p, mask) is used to decide what bytes in the filename need quoting. This is too simplistic an approach: quoting is introduced in the middle of multibyte characters. So the current setup is buggy and wrong. Basically the choice is between making the unfortunately named nocontrol (it should be called preserve_name or so) the default and adding more heuristics to detect and solve the worst problems. For example, UTF-8 is easy to