Re: How to detect the encoding of a string?

Abel Cheung Fri, 03 Jun 2005 13:20:42 -0700

On 6/3/05, Bruno Haible <[EMAIL PROTECTED]> wrote:
> 2) Assuming that the language of the person who extracts the zip often matches
>    the language of the one who created it, you can set up a list of encodings
>    to try:
[..... snip long list of encodings......]


gedit has already been doing something similar, letting users
configure a list of encodings to test against; though gedit is
checking text file encoding, and file-roller is checking filename
encoding.

Besides, for default encoding list, it might be better to make
decision on district as well as language, since districts
might even use different preferred encodings, though they share
the same language code. For example, Ukrainians may want
to try KOI8-U before KOI8-R, while Russians might want to use
opposite order.


> 
> 3) Look at the set of file names in the zip. If they _all_ happen to be
>    in UTF-8, you can assume that's it (because there are very few
>    meaningful strings which look like UTF-8 but aren't).

Yes, that's rare, though real world case has really happened before,
especially for multibyte characters. Here is a sample:

http://qa.mandrakesoft.com/show_bug.cgi?id=3935


>    Then go ahead similarly for the other encodings.
> 
>    Furthermore, for Chinese, you can use frequency-of-characters based
>    techniques such as
>      http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>      http://kamares.ucsd.edu/~arobert/hanziData.html
>      http://www.mandarintools.com/codeguess.html
> 
> Bruno
> 
> 
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/
> 
> 


-- 
Abel Cheung   (GPG Key: 0xC67186FF)
Key fingerprint: 671C C7AE EFB5 110C D6D1  41EE 4152 E1F1 C671 86FF
--------------------------------------------------------------------
* GNOME Hong Kong - http://www.gnome.hk/
* Opensource Application Knowledge Assoc. - http://oaka.org/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: How to detect the encoding of a string?

Reply via email to