Re: How to detect the encoding of a string?

Danilo Segan Thu, 02 Jun 2005 13:03:38 -0700

Hi Simos,

It's completely impossible to detect which of the 8-bit encodings is
used without any further knowledge (for instance, of the language in
use).

To be able to actually decide for one of the many 8-bit encodings
suitable for a language, one would also need to know language 
properties (such as frequency of each of letter in it), but it's still
unlikely that it would work for as short strings as filenames are.

If you need a formal proof of "undetectability", here's one:
- valid ISO-8859-1 string is always completely valid ISO-8859-2 (or
-4, -5) string (they occupy exactly the same spots 0xa1-0xff),
e.g. you can *never* determine if some character not present in
another set is actually used.

Today at 20:16, Simos Xenitellis wrote:

> P.S.
> If you would like to experiment with your own ZIP application,
> try
> http://www.thranio.gr/sxolikes-giortes/telikes/omilies/apoxairetisthrio-logos-mathith.zip
> The filename is encoded in CP737 (a la iconv). All open-source ZIP
> tools (=unzip, file-roller, ark) fail to detect the encoding.
> WinZip is able to detect the encoding.

My guess is that WinZip is running on a Greek Windows, and that
WinZip uses old IBM encodings for i18n names on them, assuming CP737
on Greek system.

Can you confirm or dispute my assumption (by eg. trying on a non-Greek
Windows system, or just confirming that this was actually attempted on
a non-Greek system)?

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: How to detect the encoding of a string?

Reply via email to