On Mon, Aug 24, 2015 at 03:44:09PM +0200, Tim Ruehsen wrote: > Just implemented (or let's say fixed) Content-Disposition in wget2. It now > saves the file as > 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
Good! > Content-Disposition (filename, filename*) is standardized, but browsers seems > to behave/parse very different, ignoring standards. Yes. On the web a general phenomenon is that non-specialists create websites. They know nothing about standards, but fiddle until it works (say, with IE6). Also Microsoft does/did not respect standards. A consequence is that practice is more important than theory. One has to try for robust solutions. > > I prefer to base the decision about what to do on the form > > of the filename (ASCII / UTF-8 / other), not on the > > headers encountered on the way to this file. > > I guess we can find an easy agreement. > > 1. Wget has to obey the defaults. If it fails or we find a well-known > misbehavior (server/document fault), handle it automatically. > That's how we try do do it now. > > 2. If still a problem arises, the user should be able to intercept. Using > special command line options for fine-tuning Wget's behavior. Yes, whatever the user says, we do, the case where options have been given is nonproblematic. Remains your point 1. I am not sure what you think the defaults are. My basic example is the %-encoded pure ASCII url, referring to a non-text object. How should wget save the object? There is zero charset information. My answer today (after conversation with Eli) is: "Decode the %-encoded string. The last part is the suggested filename. If it is ASCII, use that ASCII name (where valid for the OS). If it is UTF-8 (but not ASCII), use it when the locale is UTF-8, otherwise convert (if possible) or escape. If it is not UTF-8, escape." [That is, I would recognize only what is easy to recognize, and prefer not to rely on any headers. Prefer not to convert except possibly in the UTF-8 case.] How does your answer differ? Some ancient docs say that ISO-8859-1 is a default. Even if such docs have not yet been replaced, I feel that they no longer reflect current practice. ISO-8859-x is dying. All the web should converge to Unicode, whatever that may be. The relevant example might be that http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg I have the impression that you are happy with "kn=C3=A4ckebr=C3=B6d.jpg" but I would be unhappy with that (although it happens to be correct), since guessing and conversion is involved. Guessing may not be so bad, but guessing and converting is terrible: it can be really complicated to retrieve the original filename after an incorrect conversion. Andries Another URL: http://hongaarskinderplezier.eu/index.php?pagina=96&naam=Gy%25F5r-Moson-Sopron This is about holidays near the beautiful city Győr in Hungary. But what happened with the city? Its name was written in ISO-8859-2, using 0xf5, and that was %-escaped to %f5, and that was again %-escaped to %25f5. I would undo the %-escape and see pure ASCII, and save as index.php?pagina=96&naam=Gy%F5r-Moson-Sopron. What would you do? The page has <meta charset="ISO-8859-2" /> The headers have Content-Type: text/html without charset information. --- Similarly http://www.matklubben.se/recept/lchf+kn%25e4ckebr%25f6d+mandelmj%25f6l has the %-encoded version of "Lchf kn%e4ckebr%f6d mandelmj%f6l" which again encoded the ISO-8859-1 version of lchf knäckebröd mandelmjöl. Such double encodings are not uncommon. But as a first approximation I think wget should not try to recognize them. --- http://www.eet-china.com/SEARCH/ART/%EF%BC%85C0%EF%BC%85B6%E7%9A%84%EF%BC%85D1%E7%9A%84%EF%BC%85C0.HTM ends in %C0%B6的%D1的%C0.HTM - this is an %-encoding using fat %-signs (U+ff05). You see that one can encounter all levels of messiness.
