SUMMARY: Zip/Unzip and encoding problem (Was: Re: How to detect the encoding of a string?)

Simos Xenitellis Fri, 03 Jun 2005 12:27:33 -0700

Hi,
I'ld like to thank everyone for their useful inputs to this issue. Based
on your suggestions I am opening a bug report for file-roller
(fileroller.sourceforge.net/) and notifying the authors of Info-Zip
(http://www.info-zip.org/pub/infozip/) on the issue. If you are from KDE
and use the "ark" archiver, please report as the problem exists there as
well. Same goes for 7-Zip Windows users (http://www.7-zip.org/).


If you have something more to add, please do so at the individual
bugzilla pages.

File-roller bug report:
http://bugzilla.gnome.org/show_bug.cgi?id=306403

7-Zip bug report:
https://sourceforge.net/tracker/index.php?func=detail&aid=1214471&group_id=14481&atid=114481

My main concern was with GUI ZIP archivers, that should work no matter
what the compressed file is ("It just works" philosophy, GNOME :)) in
contrast to command line archivers. I just realised that at least
"file-roller" is a front-end to Info-Zip (zip/unzip/etc), therefore
there needs to be work as well as there. Probably the same with Ark.

As noted in http://mail.nl.linux.org/linux-utf8/2005-06/msg00006.html
"unzip" has a bug and tries to force a character conversion from CP437
to latin-1, loosing the encoding information for any program that calls
it. Therefore, "unzip" should be fixed as well so that any program that
calls it can retrieve at least the original filename, and proceed with
an "intelligent" conversion to UTF-8. Actually, the fix might need to be
done in Info-Zip altogether, since "unzip" needs a way to extract and
place the file on the filesystem. There is not way that file-roller can
do something "unzip [EMAIL PROTECTED]@#$.doc --saveas=test.doc", since "unzip"
cannot extract a file to a different filename.

The thread started at http://mail.nl.linux.org/linux-utf8/2005-06/#00000
and from there one can view the whole discussion.

Indeed in the general case it is not possible to detect which 8-bit
encoding a string has. The byte values for the alphabet might give a
hint, for example iso-8859-x, x>1, it's roughly between 128 and 180. For
CPxxx (such as CP737) encodings, it's roughly over 180. 
The Zip program can figure out the language variable (suppose it's
Greek). If the filename is not valid UTF-8, it's probably in a Greek
8-bit encoding. There are two main options here, ISO-8859-7 and CP737.
If you try iconv (1), it will only work for the correct encoding, while
it will fail for the other (due to the positioning).

Specifically:
> zipnote apoxairetisthrio-logos-mathith.zip | iconv -f CP737 -t utf-8
@ Αποχαιρετιστήριος λόγος μαθητή.doc
@ (comment above this line)
@ (zip file comment below this line)

> zipnote apoxairetisthrio-logos-mathith.zip | iconv -f ISO-8859-7 -t
utf-8
@ €§¦iconv: illegal input sequence at position 5

> zipnote apoxairetisthrio-logos-mathith.zip | iconv -f CP1253 -t utf-8
@ €§¦®iconv: illegal input sequence at position 6

Therefore, as Bruno described in
http://mail.nl.linux.org/linux-utf8/2005-06/msg00009.html
the ZIP application should check if the filename is valid UTF-8, and if
not, it should do something about it. Try to convert with heuristics to
UTF-8 (see Bruno's e-mail), else as last resort replace the unknown
characters with the Unicode Replacement character "�" (thanks Egmont,
http://www.fileformat.info/info/unicode/char/fffd/index.htm).

Simos

Στις 03/Ιούν/2005, ημέρα Παρασκευή και ώρα 14:08, 
ο/η Bruno Haible
έγραψε:
> Simos Xenitellis wrote:
> > Is there a library or sample program that can do such a "encoding
> > detection" based on short strings of unknown encoding
> > (or to choose from encodings based on a smaller list than "iconv --list")?
> 
> It's very unfortunate the encoding of the filenames is not specified in the
> central_directory_file_header in unzip.h. So the best you can do is to
> fall back on heuristics, based on these three bits of information:
> 
>  1) the version_made_by[1] field, which contains the OS on which the zip
>     file was made.
>  2) the locale (especially language) of the user who attempts to extract the
>     zip,
>  3) the set of filenames in the zip file.
> 
> Here's how you can use this information to do something meaningful:
> 
> 1) You know that AMIGA used the ISO-8859-1 encoding, ATARI used the ATARIST
>    encoding, FS_NTFS and FS_VFAT use preferrably Windows encodings, BEOS
>    uses UTF-8, MAC uses the MAC-* specific encodings, MAC_OSX uses UTF-8 in
>    decomposed normal form.
> 
> 2) Assuming that the language of the person who extracts the zip often matches
>    the language of the one who created it, you can set up a list of encodings
>    to try:
> 
>    Afrikaans          UTF-8 ISO-8859-15 ISO-8859-1
>    Albanian           UTF-8 ISO-8859-15 ISO-8859-1
>    Arabic             UTF-8 ISO-8859-6 CP1256
>    Armenian           UTF-8 ARMSCII-8
>    Basque             UTF-8 ISO-8859-15 ISO-8859-1
>    Breton             UTF-8 ISO-8859-15 ISO-8859-1
>    Bulgarian          UTF-8 ISO-8859-5
>    Byelorussian       UTF-8 ISO-8859-5
>    Catalan            UTF-8 ISO-8859-15 ISO-8859-1
>    Chinese            UTF-8 GB18030 CP936 CP950 BIG5 BIG5-HKSCS EUC-TW
>    Cornish            UTF-8 ISO-8859-15 ISO-8859-1
>    Croatian           UTF-8 ISO-8859-2
>    Czech              UTF-8 ISO-8859-2
>    Danish             UTF-8 ISO-8859-15 ISO-8859-1
>    Dutch              UTF-8 ISO-8859-15 ISO-8859-1
>    English            UTF-8 ISO-8859-15 ISO-8859-1
>    Esperanto          UTF-8 ISO-8859-3
>    Estonian           UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
>    Faeroese           UTF-8 ISO-8859-15 ISO-8859-1
>    Finnish            UTF-8 ISO-8859-15 ISO-8859-1
>    French             UTF-8 ISO-8859-15 ISO-8859-1
>    Frisian            UTF-8 ISO-8859-15 ISO-8859-1
>    Galician           UTF-8 ISO-8859-15 ISO-8859-1
>    Georgian           UTF-8 GEORGIAN-ACADEMY GEORGIAN-PS
>    German             UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-2
>    Greek              UTF-8 ISO-8859-7
>    Greenlandic        UTF-8 ISO-8859-15 ISO-8859-1
>    Hebrew             UTF-8 ISO-8859-8 CP1255
>    Hungarian          UTF-8 ISO-8859-2
>    Icelandic          UTF-8 ISO-8859-10 ISO-8859-15 ISO-8859-1
>    Inuit              UTF-8 ISO-8859-10
>    Irish              UTF-8 ISO-8859-14 ISO-8859-15 ISO-8859-1
>    Italian            UTF-8 ISO-8859-15 ISO-8859-1
>    Japanese           UTF-8 EUC-JP CP932
>    Kazakh             UTF-8 PT154
>    Korean             UTF-8 EUC-KR CP949 JOHAB
>    Laotian            UTF-8 MULELAO-1 CP1133
>    Latin              UTF-8 ISO-8859-15 ISO-8859-1
>    Latvian            UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
>    Lithuanian         UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
>    Luxemburgish       UTF-8 ISO-8859-15 ISO-8859-1
>    Macedonian         UTF-8 ISO-8859-5
>    Maltese            UTF-8 ISO-8859-3
>    Manx Gaelic        UTF-8 ISO-8859-14
>    Norwegian          UTF-8 ISO-8859-15 ISO-8859-1
>    Polish             UTF-8 ISO-8859-2 ISO-8859-13
>    Portuguese         UTF-8 ISO-8859-15 ISO-8859-1
>    Raeto-Romanic      UTF-8 ISO-8859-15 ISO-8859-1
>    Romanian           UTF-8 ISO-8859-16
>    Russian            UTF-8 KOI8-R ISO-8859-5 KOI8-RU
>    Sami               UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
>    Scottish           UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-14
>    Serbian            UTF-8 ISO-8859-5
>    Slovak             UTF-8 ISO-8859-2
>    Slovenian          UTF-8 ISO-8859-2
>    Sorbian            UTF-8 ISO-8859-2
>    Spanish            UTF-8 ISO-8859-15 ISO-8859-1
>    Swedish languages  UTF-8 ISO-8859-15 ISO-8859-1
>    Tajik              UTF-8 KOI8-T
>    Thai               UTF-8 ISO-8859-11 TIS-620 CP874
>    Turkish            UTF-8 ISO-8859-9
>    Ukrainian          UTF-8 KOI8-U ISO-8859-5
>    Vietnamese         UTF-8 VISCII TCVN CP1258
>    Welsh              UTF-8 ISO-8859-14
> 
> 3) Look at the set of file names in the zip. If they _all_ happen to be
>    in UTF-8, you can assume that's it (because there are very few
>    meaningful strings which look like UTF-8 but aren't).
>    Then go ahead similarly for the other encodings.
> 
>    Furthermore, for Chinese, you can use frequency-of-characters based
>    techniques such as
>      http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>      http://kamares.ucsd.edu/~arobert/hanziData.html
>      http://www.mandarintools.com/codeguess.html
> 
> Bruno
> 
> 
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/
> 


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

SUMMARY: Zip/Unzip and encoding problem (Was: Re: How to detect the encoding of a string?)

Reply via email to