On 6/3/05, Bruno Haible <[EMAIL PROTECTED]> wrote: > 2) Assuming that the language of the person who extracts the zip often matches > the language of the one who created it, you can set up a list of encodings > to try: [..... snip long list of encodings......]
gedit has already been doing something similar, letting users configure a list of encodings to test against; though gedit is checking text file encoding, and file-roller is checking filename encoding. Besides, for default encoding list, it might be better to make decision on district as well as language, since districts might even use different preferred encodings, though they share the same language code. For example, Ukrainians may want to try KOI8-U before KOI8-R, while Russians might want to use opposite order. > > 3) Look at the set of file names in the zip. If they _all_ happen to be > in UTF-8, you can assume that's it (because there are very few > meaningful strings which look like UTF-8 but aren't). Yes, that's rare, though real world case has really happened before, especially for multibyte characters. Here is a sample: http://qa.mandrakesoft.com/show_bug.cgi?id=3935 > Then go ahead similarly for the other encodings. > > Furthermore, for Chinese, you can use frequency-of-characters based > techniques such as > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html > http://kamares.ucsd.edu/~arobert/hanziData.html > http://www.mandarintools.com/codeguess.html > > Bruno > > > -- > Linux-UTF8: i18n of Linux on all levels > Archive: http://mail.nl.linux.org/linux-utf8/ > > -- Abel Cheung (GPG Key: 0xC67186FF) Key fingerprint: 671C C7AE EFB5 110C D6D1 41EE 4152 E1F1 C671 86FF -------------------------------------------------------------------- * GNOME Hong Kong - http://www.gnome.hk/ * Opensource Application Knowledge Assoc. - http://oaka.org/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
