Re: Automatic encoding guessing

H. Peter Anvin Tue, 23 Oct 2001 10:35:59 -0700

Followup to:  <[EMAIL PROTECTED]>
By author:    Markus Kuhn <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> Depending on the amount of effort, you can distinguish different
> encodings quite well as long as the text is long enough for the usual
> cryptoanalytic techniques for breaking substitution ciphers to work
> (which means usually >500 characters):
> 
>   - UTF-8 follows strict rules and every other encoding (except for the
>     UTF-8 subset ASCII, which usually hasn't to be distinguished)
>     will contain either malformed UTF-8 sequences (when it's an
>     8-bit encoding) or ISO  2022 sequences (when it's a CJK
>     encoding), both of which make it pretty unlikely that a
>     non-UTF-8 encoding is mistaken for a UTF-8 encoding.
>


I have had data corruption because of the above assumption (some
versions of Tcl seems to make it) -- there are legal ISO-8859-x
sequences which are also legal UTF-8 sequences.

>   - EUC files similarly have characteristic byte sequences that are not
>     allowed in these encodings, such as unpaired GR bytes.
> 
>   - ISO 8859 files should be free of C1 and most C0 codes (except
>     for the usual LF/TAB).

I have also had Emacs 20 garble data because of the above assumption
:(

Please, people; remember that heuristics are just that and can't be
blindly trusted :(

        -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt    <[EMAIL PROTECTED]>
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Automatic encoding guessing

Reply via email to