Simos Xenitellis wrote:
> Is there a library or sample program that can do such a "encoding
> detection" based on short strings of unknown encoding
> (or to choose from encodings based on a smaller list than "iconv --list")?
It's very unfortunate the encoding of the filenames is not specified in the
central_directory_file_header in unzip.h. So the best you can do is to
fall back on heuristics, based on these three bits of information:
1) the version_made_by[1] field, which contains the OS on which the zip
file was made.
2) the locale (especially language) of the user who attempts to extract the
zip,
3) the set of filenames in the zip file.
Here's how you can use this information to do something meaningful:
1) You know that AMIGA used the ISO-8859-1 encoding, ATARI used the ATARIST
encoding, FS_NTFS and FS_VFAT use preferrably Windows encodings, BEOS
uses UTF-8, MAC uses the MAC-* specific encodings, MAC_OSX uses UTF-8 in
decomposed normal form.
2) Assuming that the language of the person who extracts the zip often matches
the language of the one who created it, you can set up a list of encodings
to try:
Afrikaans UTF-8 ISO-8859-15 ISO-8859-1
Albanian UTF-8 ISO-8859-15 ISO-8859-1
Arabic UTF-8 ISO-8859-6 CP1256
Armenian UTF-8 ARMSCII-8
Basque UTF-8 ISO-8859-15 ISO-8859-1
Breton UTF-8 ISO-8859-15 ISO-8859-1
Bulgarian UTF-8 ISO-8859-5
Byelorussian UTF-8 ISO-8859-5
Catalan UTF-8 ISO-8859-15 ISO-8859-1
Chinese UTF-8 GB18030 CP936 CP950 BIG5 BIG5-HKSCS EUC-TW
Cornish UTF-8 ISO-8859-15 ISO-8859-1
Croatian UTF-8 ISO-8859-2
Czech UTF-8 ISO-8859-2
Danish UTF-8 ISO-8859-15 ISO-8859-1
Dutch UTF-8 ISO-8859-15 ISO-8859-1
English UTF-8 ISO-8859-15 ISO-8859-1
Esperanto UTF-8 ISO-8859-3
Estonian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
Faeroese UTF-8 ISO-8859-15 ISO-8859-1
Finnish UTF-8 ISO-8859-15 ISO-8859-1
French UTF-8 ISO-8859-15 ISO-8859-1
Frisian UTF-8 ISO-8859-15 ISO-8859-1
Galician UTF-8 ISO-8859-15 ISO-8859-1
Georgian UTF-8 GEORGIAN-ACADEMY GEORGIAN-PS
German UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-2
Greek UTF-8 ISO-8859-7
Greenlandic UTF-8 ISO-8859-15 ISO-8859-1
Hebrew UTF-8 ISO-8859-8 CP1255
Hungarian UTF-8 ISO-8859-2
Icelandic UTF-8 ISO-8859-10 ISO-8859-15 ISO-8859-1
Inuit UTF-8 ISO-8859-10
Irish UTF-8 ISO-8859-14 ISO-8859-15 ISO-8859-1
Italian UTF-8 ISO-8859-15 ISO-8859-1
Japanese UTF-8 EUC-JP CP932
Kazakh UTF-8 PT154
Korean UTF-8 EUC-KR CP949 JOHAB
Laotian UTF-8 MULELAO-1 CP1133
Latin UTF-8 ISO-8859-15 ISO-8859-1
Latvian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
Lithuanian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
Luxemburgish UTF-8 ISO-8859-15 ISO-8859-1
Macedonian UTF-8 ISO-8859-5
Maltese UTF-8 ISO-8859-3
Manx Gaelic UTF-8 ISO-8859-14
Norwegian UTF-8 ISO-8859-15 ISO-8859-1
Polish UTF-8 ISO-8859-2 ISO-8859-13
Portuguese UTF-8 ISO-8859-15 ISO-8859-1
Raeto-Romanic UTF-8 ISO-8859-15 ISO-8859-1
Romanian UTF-8 ISO-8859-16
Russian UTF-8 KOI8-R ISO-8859-5 KOI8-RU
Sami UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
Scottish UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-14
Serbian UTF-8 ISO-8859-5
Slovak UTF-8 ISO-8859-2
Slovenian UTF-8 ISO-8859-2
Sorbian UTF-8 ISO-8859-2
Spanish UTF-8 ISO-8859-15 ISO-8859-1
Swedish languages UTF-8 ISO-8859-15 ISO-8859-1
Tajik UTF-8 KOI8-T
Thai UTF-8 ISO-8859-11 TIS-620 CP874
Turkish UTF-8 ISO-8859-9
Ukrainian UTF-8 KOI8-U ISO-8859-5
Vietnamese UTF-8 VISCII TCVN CP1258
Welsh UTF-8 ISO-8859-14
3) Look at the set of file names in the zip. If they _all_ happen to be
in UTF-8, you can assume that's it (because there are very few
meaningful strings which look like UTF-8 but aren't).
Then go ahead similarly for the other encodings.
Furthermore, for Chinese, you can use frequency-of-characters based
techniques such as
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
http://kamares.ucsd.edu/~arobert/hanziData.html
http://www.mandarintools.com/codeguess.html
Bruno
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/