Re: How to detect the encoding of a string?

Bruno Haible Fri, 03 Jun 2005 06:11:12 -0700

Simos Xenitellis wrote:
> Is there a library or sample program that can do such a "encoding
> detection" based on short strings of unknown encoding
> (or to choose from encodings based on a smaller list than "iconv --list")?


It's very unfortunate the encoding of the filenames is not specified in the
central_directory_file_header in unzip.h. So the best you can do is to
fall back on heuristics, based on these three bits of information:

 1) the version_made_by[1] field, which contains the OS on which the zip
    file was made.
 2) the locale (especially language) of the user who attempts to extract the
    zip,
 3) the set of filenames in the zip file.

Here's how you can use this information to do something meaningful:

1) You know that AMIGA used the ISO-8859-1 encoding, ATARI used the ATARIST
   encoding, FS_NTFS and FS_VFAT use preferrably Windows encodings, BEOS
   uses UTF-8, MAC uses the MAC-* specific encodings, MAC_OSX uses UTF-8 in
   decomposed normal form.

2) Assuming that the language of the person who extracts the zip often matches
   the language of the one who created it, you can set up a list of encodings
   to try:

   Afrikaans          UTF-8 ISO-8859-15 ISO-8859-1
   Albanian           UTF-8 ISO-8859-15 ISO-8859-1
   Arabic             UTF-8 ISO-8859-6 CP1256
   Armenian           UTF-8 ARMSCII-8
   Basque             UTF-8 ISO-8859-15 ISO-8859-1
   Breton             UTF-8 ISO-8859-15 ISO-8859-1
   Bulgarian          UTF-8 ISO-8859-5
   Byelorussian       UTF-8 ISO-8859-5
   Catalan            UTF-8 ISO-8859-15 ISO-8859-1
   Chinese            UTF-8 GB18030 CP936 CP950 BIG5 BIG5-HKSCS EUC-TW
   Cornish            UTF-8 ISO-8859-15 ISO-8859-1
   Croatian           UTF-8 ISO-8859-2
   Czech              UTF-8 ISO-8859-2
   Danish             UTF-8 ISO-8859-15 ISO-8859-1
   Dutch              UTF-8 ISO-8859-15 ISO-8859-1
   English            UTF-8 ISO-8859-15 ISO-8859-1
   Esperanto          UTF-8 ISO-8859-3
   Estonian           UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Faeroese           UTF-8 ISO-8859-15 ISO-8859-1
   Finnish            UTF-8 ISO-8859-15 ISO-8859-1
   French             UTF-8 ISO-8859-15 ISO-8859-1
   Frisian            UTF-8 ISO-8859-15 ISO-8859-1
   Galician           UTF-8 ISO-8859-15 ISO-8859-1
   Georgian           UTF-8 GEORGIAN-ACADEMY GEORGIAN-PS
   German             UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-2
   Greek              UTF-8 ISO-8859-7
   Greenlandic        UTF-8 ISO-8859-15 ISO-8859-1
   Hebrew             UTF-8 ISO-8859-8 CP1255
   Hungarian          UTF-8 ISO-8859-2
   Icelandic          UTF-8 ISO-8859-10 ISO-8859-15 ISO-8859-1
   Inuit              UTF-8 ISO-8859-10
   Irish              UTF-8 ISO-8859-14 ISO-8859-15 ISO-8859-1
   Italian            UTF-8 ISO-8859-15 ISO-8859-1
   Japanese           UTF-8 EUC-JP CP932
   Kazakh             UTF-8 PT154
   Korean             UTF-8 EUC-KR CP949 JOHAB
   Laotian            UTF-8 MULELAO-1 CP1133
   Latin              UTF-8 ISO-8859-15 ISO-8859-1
   Latvian            UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Lithuanian         UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Luxemburgish       UTF-8 ISO-8859-15 ISO-8859-1
   Macedonian         UTF-8 ISO-8859-5
   Maltese            UTF-8 ISO-8859-3
   Manx Gaelic        UTF-8 ISO-8859-14
   Norwegian          UTF-8 ISO-8859-15 ISO-8859-1
   Polish             UTF-8 ISO-8859-2 ISO-8859-13
   Portuguese         UTF-8 ISO-8859-15 ISO-8859-1
   Raeto-Romanic      UTF-8 ISO-8859-15 ISO-8859-1
   Romanian           UTF-8 ISO-8859-16
   Russian            UTF-8 KOI8-R ISO-8859-5 KOI8-RU
   Sami               UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Scottish           UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-14
   Serbian            UTF-8 ISO-8859-5
   Slovak             UTF-8 ISO-8859-2
   Slovenian          UTF-8 ISO-8859-2
   Sorbian            UTF-8 ISO-8859-2
   Spanish            UTF-8 ISO-8859-15 ISO-8859-1
   Swedish languages  UTF-8 ISO-8859-15 ISO-8859-1
   Tajik              UTF-8 KOI8-T
   Thai               UTF-8 ISO-8859-11 TIS-620 CP874
   Turkish            UTF-8 ISO-8859-9
   Ukrainian          UTF-8 KOI8-U ISO-8859-5
   Vietnamese         UTF-8 VISCII TCVN CP1258
   Welsh              UTF-8 ISO-8859-14

3) Look at the set of file names in the zip. If they _all_ happen to be
   in UTF-8, you can assume that's it (because there are very few
   meaningful strings which look like UTF-8 but aren't).
   Then go ahead similarly for the other encodings.

   Furthermore, for Chinese, you can use frequency-of-characters based
   techniques such as
     http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
     http://kamares.ucsd.edu/~arobert/hanziData.html
     http://www.mandarintools.com/codeguess.html

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: How to detect the encoding of a string?

Reply via email to