On Tue, 23 Oct 2001, D. Dale Gulledge wrote:
> Is there a reliable tool for determining the encoding of a file?
No, but here are a few idea for whoever wants to make a good one
(probably to be contributed for the GNU "file" utility):
Depending on the amount of effort, you can distinguish different
encodings quite well as long as the text is long enough for the usual
cryptoanalytic techniques for breaking substitution ciphers to work
(which means usually >500 characters):
- UTF-8 follows strict rules and every other encoding (except for the
UTF-8 subset ASCII, which usually hasn't to be distinguished)
will contain either malformed UTF-8 sequences (when it's an
8-bit encoding) or ISO 2022 sequences (when it's a CJK
encoding), both of which make it pretty unlikely that a
non-UTF-8 encoding is mistaken for a UTF-8 encoding.
- EUC files similarly have characteristic byte sequences that are not
allowed in these encodings, such as unpaired GR bytes.
- ISO 8859 files should be free of C1 and most C0 codes (except
for the usual LF/TAB).
- Any file should be free of unused code positions.
- You can do a bit more with character and tuple frequency
analysis. You need for various languages (English, German,
French, C, Lisp) and their transliterations a library of
frequency tables for the various UCS characters/pairs,
and then you try all Something->UCS conversions
until you find the best match of the resulting histogram
with one in the library (read up on "index of coincidence"
[Friedman, ~1920] in introductory cryptanalysis textbooks
such as Stinson).
- Add to that rules, which languages are likely to be encoded
in which way.
- Add to that a library of clue patterns from standardized marker
formats such as MIME headers, .htaccess files, Emacs headers,
the locale, etc.
- Set up a rule-based resoltion algorithm that merges the results
of these tests based on rule priorities. For instance, the
presence of malformed UTF-8 characters should likely have more
weight than fragments of a MIME header that claim UTF-8.
Make all that configureable for the end users, as they are
likely to have further a-priori knowledge of what encodings
are to be expected.
These beasts used to be called "expert systems" when I went to
school and "A.I." was a research field, not a movie title ...
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/