Re: [Bug-wget] bad filenames (again)

Andries E. Brouwer Tue, 25 Aug 2015 06:00:19 -0700

On Mon, Aug 24, 2015 at 03:44:09PM +0200, Tim Ruehsen wrote:

> Just implemented (or let's say fixed) Content-Disposition in wget2. It now
> saves the file as
> 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf


Good!

> Content-Disposition (filename, filename*) is standardized, but browsers seems 
> to behave/parse very different, ignoring standards.

Yes. On the web a general phenomenon is that non-specialists create websites.
They know nothing about standards, but fiddle until it works (say, with IE6).
Also Microsoft does/did not respect standards.

A consequence is that practice is more important than theory.
One has to try for robust solutions.

> > I prefer to base the decision about what to do on the form
> > of the filename (ASCII / UTF-8 / other), not on the
> > headers encountered on the way to this file.
> 
> I guess we can find an easy agreement.
> 
> 1. Wget has to obey the defaults. If it fails or we find a well-known 
> misbehavior (server/document fault), handle it automatically.
> That's how we try do do it now.
> 
> 2. If still a problem arises, the user should be able to intercept. Using 
> special command line options for fine-tuning Wget's behavior.

Yes, whatever the user says, we do, the case where options have been given
is nonproblematic.

Remains your point 1. I am not sure what you think the defaults are.

My basic example is the %-encoded pure ASCII url, referring to a non-text
object. How should wget save the object? There is zero charset information.
My answer today (after conversation with Eli) is:
"Decode the %-encoded string. The last part is the suggested filename.
If it is ASCII, use that ASCII name (where valid for the OS).
If it is UTF-8 (but not ASCII), use it when the locale is UTF-8,
otherwise convert (if possible) or escape.  If it is not UTF-8, escape."

[That is, I would recognize only what is easy to recognize,
and prefer not to rely on any headers. Prefer not to convert
except possibly in the UTF-8 case.]

How does your answer differ?
Some ancient docs say that ISO-8859-1 is a default. Even if such docs
have not yet been replaced, I feel that they no longer reflect current
practice. ISO-8859-x is dying. All the web should converge to Unicode,
whatever that may be.

The relevant example might be that
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
I have the impression that you are happy with "kn=C3=A4ckebr=C3=B6d.jpg"
but I would be unhappy with that (although it happens to be correct),
since guessing and conversion is involved.
Guessing may not be so bad, but guessing and converting is terrible:
it can be really complicated to retrieve the original filename
after an incorrect conversion.

Andries


Another URL:
http://hongaarskinderplezier.eu/index.php?pagina=96&naam=Gy%25F5r-Moson-Sopron
This is about holidays near the beautiful city Győr in Hungary.
But what happened with the city? Its name was written in ISO-8859-2,
using 0xf5, and that was %-escaped to %f5, and that was again
%-escaped to %25f5.

I would undo the %-escape and see pure ASCII, and save as
index.php?pagina=96&naam=Gy%F5r-Moson-Sopron.
What would you do?
The page has <meta charset="ISO-8859-2" />
The headers have Content-Type: text/html without charset information.

---

Similarly http://www.matklubben.se/recept/lchf+kn%25e4ckebr%25f6d+mandelmj%25f6l
has the %-encoded version of "Lchf kn%e4ckebr%f6d mandelmj%f6l"
which again encoded the ISO-8859-1 version of lchf knäckebröd mandelmjöl.

Such double encodings are not uncommon.
But as a first approximation I think wget should not try to recognize them.

---

http://www.eet-china.com/SEARCH/ART/%EF%BC%85C0%EF%BC%85B6%E7%9A%84%EF%BC%85D1%E7%9A%84%EF%BC%85C0.HTM
ends in ％C0％B6的％D1的％C0.HTM - this is an %-encoding using fat %-signs (U+ff05).

You see that one can encounter all levels of messiness.

Re: [Bug-wget] bad filenames (again)

Reply via email to