Hi, I'ld like to thank everyone for their useful inputs to this issue. Based on your suggestions I am opening a bug report for file-roller (fileroller.sourceforge.net/) and notifying the authors of Info-Zip (http://www.info-zip.org/pub/infozip/) on the issue. If you are from KDE and use the "ark" archiver, please report as the problem exists there as well. Same goes for 7-Zip Windows users (http://www.7-zip.org/).
If you have something more to add, please do so at the individual bugzilla pages. File-roller bug report: http://bugzilla.gnome.org/show_bug.cgi?id=306403 7-Zip bug report: https://sourceforge.net/tracker/index.php?func=detail&aid=1214471&group_id=14481&atid=114481 My main concern was with GUI ZIP archivers, that should work no matter what the compressed file is ("It just works" philosophy, GNOME :)) in contrast to command line archivers. I just realised that at least "file-roller" is a front-end to Info-Zip (zip/unzip/etc), therefore there needs to be work as well as there. Probably the same with Ark. As noted in http://mail.nl.linux.org/linux-utf8/2005-06/msg00006.html "unzip" has a bug and tries to force a character conversion from CP437 to latin-1, loosing the encoding information for any program that calls it. Therefore, "unzip" should be fixed as well so that any program that calls it can retrieve at least the original filename, and proceed with an "intelligent" conversion to UTF-8. Actually, the fix might need to be done in Info-Zip altogether, since "unzip" needs a way to extract and place the file on the filesystem. There is not way that file-roller can do something "unzip [EMAIL PROTECTED]@#$.doc --saveas=test.doc", since "unzip" cannot extract a file to a different filename. The thread started at http://mail.nl.linux.org/linux-utf8/2005-06/#00000 and from there one can view the whole discussion. Indeed in the general case it is not possible to detect which 8-bit encoding a string has. The byte values for the alphabet might give a hint, for example iso-8859-x, x>1, it's roughly between 128 and 180. For CPxxx (such as CP737) encodings, it's roughly over 180. The Zip program can figure out the language variable (suppose it's Greek). If the filename is not valid UTF-8, it's probably in a Greek 8-bit encoding. There are two main options here, ISO-8859-7 and CP737. If you try iconv (1), it will only work for the correct encoding, while it will fail for the other (due to the positioning). Specifically: > zipnote apoxairetisthrio-logos-mathith.zip | iconv -f CP737 -t utf-8 @ Αποχαιρετιστήριος λόγος μαθητή.doc @ (comment above this line) @ (zip file comment below this line) > zipnote apoxairetisthrio-logos-mathith.zip | iconv -f ISO-8859-7 -t utf-8 @ §¦iconv: illegal input sequence at position 5 > zipnote apoxairetisthrio-logos-mathith.zip | iconv -f CP1253 -t utf-8 @ €§¦®iconv: illegal input sequence at position 6 Therefore, as Bruno described in http://mail.nl.linux.org/linux-utf8/2005-06/msg00009.html the ZIP application should check if the filename is valid UTF-8, and if not, it should do something about it. Try to convert with heuristics to UTF-8 (see Bruno's e-mail), else as last resort replace the unknown characters with the Unicode Replacement character "�" (thanks Egmont, http://www.fileformat.info/info/unicode/char/fffd/index.htm). Simos Στις 03/Ιούν/2005, ημέρα Παρασκευή και ώρα 14:08, ο/η Bruno Haible έγραψε: > Simos Xenitellis wrote: > > Is there a library or sample program that can do such a "encoding > > detection" based on short strings of unknown encoding > > (or to choose from encodings based on a smaller list than "iconv --list")? > > It's very unfortunate the encoding of the filenames is not specified in the > central_directory_file_header in unzip.h. So the best you can do is to > fall back on heuristics, based on these three bits of information: > > 1) the version_made_by[1] field, which contains the OS on which the zip > file was made. > 2) the locale (especially language) of the user who attempts to extract the > zip, > 3) the set of filenames in the zip file. > > Here's how you can use this information to do something meaningful: > > 1) You know that AMIGA used the ISO-8859-1 encoding, ATARI used the ATARIST > encoding, FS_NTFS and FS_VFAT use preferrably Windows encodings, BEOS > uses UTF-8, MAC uses the MAC-* specific encodings, MAC_OSX uses UTF-8 in > decomposed normal form. > > 2) Assuming that the language of the person who extracts the zip often matches > the language of the one who created it, you can set up a list of encodings > to try: > > Afrikaans UTF-8 ISO-8859-15 ISO-8859-1 > Albanian UTF-8 ISO-8859-15 ISO-8859-1 > Arabic UTF-8 ISO-8859-6 CP1256 > Armenian UTF-8 ARMSCII-8 > Basque UTF-8 ISO-8859-15 ISO-8859-1 > Breton UTF-8 ISO-8859-15 ISO-8859-1 > Bulgarian UTF-8 ISO-8859-5 > Byelorussian UTF-8 ISO-8859-5 > Catalan UTF-8 ISO-8859-15 ISO-8859-1 > Chinese UTF-8 GB18030 CP936 CP950 BIG5 BIG5-HKSCS EUC-TW > Cornish UTF-8 ISO-8859-15 ISO-8859-1 > Croatian UTF-8 ISO-8859-2 > Czech UTF-8 ISO-8859-2 > Danish UTF-8 ISO-8859-15 ISO-8859-1 > Dutch UTF-8 ISO-8859-15 ISO-8859-1 > English UTF-8 ISO-8859-15 ISO-8859-1 > Esperanto UTF-8 ISO-8859-3 > Estonian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 > Faeroese UTF-8 ISO-8859-15 ISO-8859-1 > Finnish UTF-8 ISO-8859-15 ISO-8859-1 > French UTF-8 ISO-8859-15 ISO-8859-1 > Frisian UTF-8 ISO-8859-15 ISO-8859-1 > Galician UTF-8 ISO-8859-15 ISO-8859-1 > Georgian UTF-8 GEORGIAN-ACADEMY GEORGIAN-PS > German UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-2 > Greek UTF-8 ISO-8859-7 > Greenlandic UTF-8 ISO-8859-15 ISO-8859-1 > Hebrew UTF-8 ISO-8859-8 CP1255 > Hungarian UTF-8 ISO-8859-2 > Icelandic UTF-8 ISO-8859-10 ISO-8859-15 ISO-8859-1 > Inuit UTF-8 ISO-8859-10 > Irish UTF-8 ISO-8859-14 ISO-8859-15 ISO-8859-1 > Italian UTF-8 ISO-8859-15 ISO-8859-1 > Japanese UTF-8 EUC-JP CP932 > Kazakh UTF-8 PT154 > Korean UTF-8 EUC-KR CP949 JOHAB > Laotian UTF-8 MULELAO-1 CP1133 > Latin UTF-8 ISO-8859-15 ISO-8859-1 > Latvian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 > Lithuanian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 > Luxemburgish UTF-8 ISO-8859-15 ISO-8859-1 > Macedonian UTF-8 ISO-8859-5 > Maltese UTF-8 ISO-8859-3 > Manx Gaelic UTF-8 ISO-8859-14 > Norwegian UTF-8 ISO-8859-15 ISO-8859-1 > Polish UTF-8 ISO-8859-2 ISO-8859-13 > Portuguese UTF-8 ISO-8859-15 ISO-8859-1 > Raeto-Romanic UTF-8 ISO-8859-15 ISO-8859-1 > Romanian UTF-8 ISO-8859-16 > Russian UTF-8 KOI8-R ISO-8859-5 KOI8-RU > Sami UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 > Scottish UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-14 > Serbian UTF-8 ISO-8859-5 > Slovak UTF-8 ISO-8859-2 > Slovenian UTF-8 ISO-8859-2 > Sorbian UTF-8 ISO-8859-2 > Spanish UTF-8 ISO-8859-15 ISO-8859-1 > Swedish languages UTF-8 ISO-8859-15 ISO-8859-1 > Tajik UTF-8 KOI8-T > Thai UTF-8 ISO-8859-11 TIS-620 CP874 > Turkish UTF-8 ISO-8859-9 > Ukrainian UTF-8 KOI8-U ISO-8859-5 > Vietnamese UTF-8 VISCII TCVN CP1258 > Welsh UTF-8 ISO-8859-14 > > 3) Look at the set of file names in the zip. If they _all_ happen to be > in UTF-8, you can assume that's it (because there are very few > meaningful strings which look like UTF-8 but aren't). > Then go ahead similarly for the other encodings. > > Furthermore, for Chinese, you can use frequency-of-characters based > techniques such as > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html > http://kamares.ucsd.edu/~arobert/hanziData.html > http://www.mandarintools.com/codeguess.html > > Bruno > > > -- > Linux-UTF8: i18n of Linux on all levels > Archive: http://mail.nl.linux.org/linux-utf8/ > -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
