Re: Automatic encoding guessing

Markus Kuhn Tue, 23 Oct 2001 10:10:12 -0700

On Tue, 23 Oct 2001, D. Dale Gulledge wrote:
> Is there a reliable tool for determining the encoding of a file?


No, but here are a few idea for whoever wants to make a good one
(probably to be contributed for the GNU "file" utility):

Depending on the amount of effort, you can distinguish different
encodings quite well as long as the text is long enough for the usual
cryptoanalytic techniques for breaking substitution ciphers to work
(which means usually >500 characters):

  - UTF-8 follows strict rules and every other encoding (except for the
    UTF-8 subset ASCII, which usually hasn't to be distinguished)
    will contain either malformed UTF-8 sequences (when it's an
    8-bit encoding) or ISO  2022 sequences (when it's a CJK
    encoding), both of which make it pretty unlikely that a
    non-UTF-8 encoding is mistaken for a UTF-8 encoding.

  - EUC files similarly have characteristic byte sequences that are not
    allowed in these encodings, such as unpaired GR bytes.

  - ISO 8859 files should be free of C1 and most C0 codes (except
    for the usual LF/TAB).

  - Any file should be free of unused code positions.

  - You can do a bit more with character and tuple frequency
    analysis. You need for various languages (English, German,
    French, C, Lisp) and their transliterations a library of
    frequency tables for the various UCS characters/pairs,
    and then you try all Something->UCS conversions
    until you find the best match of the resulting histogram
    with one in the library (read up on "index of coincidence"
    [Friedman, ~1920] in introductory cryptanalysis textbooks
    such as Stinson).

  - Add to that rules, which languages are likely to be encoded
    in which way.

  - Add to that a library of clue patterns from standardized marker
    formats such as MIME headers, .htaccess files, Emacs headers,
    the locale, etc.

  - Set up a rule-based resoltion algorithm that merges the results
    of these tests based on rule priorities. For instance, the
    presence of malformed UTF-8 characters should likely have more
    weight than fragments of a MIME header that claim UTF-8.
    Make all that configureable for the end users, as they are
    likely to have further a-priori knowledge of what encodings
    are to be expected.

These beasts used to be called "expert systems" when I went to
school and "A.I." was a research field, not a movie title ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Automatic encoding guessing

Reply via email to