Followup to: <[EMAIL PROTECTED]>
By author: Markus Kuhn <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
>
> Depending on the amount of effort, you can distinguish different
> encodings quite well as long as the text is long enough for the usual
> cryptoanalytic techniques for breaking substitution ciphers to work
> (which means usually >500 characters):
>
> - UTF-8 follows strict rules and every other encoding (except for the
> UTF-8 subset ASCII, which usually hasn't to be distinguished)
> will contain either malformed UTF-8 sequences (when it's an
> 8-bit encoding) or ISO 2022 sequences (when it's a CJK
> encoding), both of which make it pretty unlikely that a
> non-UTF-8 encoding is mistaken for a UTF-8 encoding.
>
I have had data corruption because of the above assumption (some
versions of Tcl seems to make it) -- there are legal ISO-8859-x
sequences which are also legal UTF-8 sequences.
> - EUC files similarly have characteristic byte sequences that are not
> allowed in these encodings, such as unpaired GR bytes.
>
> - ISO 8859 files should be free of C1 and most C0 codes (except
> for the usual LF/TAB).
I have also had Emacs 20 garble data because of the above assumption
:(
Please, people; remember that heuristics are just that and can't be
blindly trusted :(
-hpa
--
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[EMAIL PROTECTED]>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/