On Wed, Aug 19, 2015 at 10:46:30PM +0300, Eli Zaretskii wrote: > OK, then let me explain my line of reasoning. Plain ASCII is valid > UTF-8, and if converting with iconv assuming it's UTF-8 fails, you > know it's not valid UTF-8. So the last 3 possibilities in your > suggestion boil down to "try converting as if it were UTF-8, and if > that fails, you know it's Unknown".
Yes, although I would not invoke iconv to actually convert from UTF-8 to UTF-8. Unicode is a complicated beast, and it is not certain that conversion from UTF-8 to UTF-8 is the identity transformation. (For example, implementations may prefer either NFC or NFD. MacOS has its own NFD-like version for filenames.) But you are right, one can use it as test. After finding out that the charset is unknown I want to hex-encode the entire filename. On the other hand, if the appropriate thing is to invoke iconv to convert from one charset to another, I want to hex-encode only the failing bytes. This difference because (a) if there is reason to expect that conversion should be possible, for example because the user specified the from-charset as GB18030, and it fails, then often only in a few isolated places where Microsoft extensions are used, and it is more user-friendly to do the conversion where possible. but (b) if nothing is known, then the character set can be a multibyte one like SJIS where ASCII bytes occur as second halves of symbols, and not escaping such ASCII bytes is confusing and sometimes leads to strange problems. Andries
