Re: [Bug-wget] bad filenames (again)

Andries E. Brouwer Thu, 20 Aug 2015 15:12:07 -0700

On Wed, Aug 19, 2015 at 10:46:30PM +0300, Eli Zaretskii wrote:

> OK, then let me explain my line of reasoning.  Plain ASCII is valid
> UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
> know it's not valid UTF-8.  So the last 3 possibilities in your
> suggestion boil down to "try converting as if it were UTF-8, and if
> that fails, you know it's Unknown".


Yes, although I would not invoke iconv to actually convert from UTF-8 to
UTF-8. Unicode is a complicated beast, and it is not certain that
conversion from UTF-8 to UTF-8 is the identity transformation.
(For example, implementations may prefer either NFC or NFD.
MacOS has its own NFD-like version for filenames.)
But you are right, one can use it as test.

After finding out that the charset is unknown I want to hex-encode
the entire filename. On the other hand, if the appropriate thing
is to invoke iconv to convert from one charset to another, I want
to hex-encode only the failing bytes.

This difference because (a) if there is reason to expect that
conversion should be possible, for example because the user
specified the from-charset as GB18030, and it fails, then often
only in a few isolated places where Microsoft extensions are used,
and it is more user-friendly to do the conversion where possible.
but (b) if nothing is known, then the character set can be a
multibyte one like SJIS where ASCII bytes occur as second halves
of symbols, and not escaping such ASCII bytes is confusing
and sometimes leads to strange problems.

Andries

Re: [Bug-wget] bad filenames (again)

Reply via email to