Re: Automatic encoding guessing

Jim Breen Wed, 24 Oct 2001 16:38:44 -0700

[Markus Kuhn (Re: Automatic encoding guessing) writes:]
>> On Tue, 23 Oct 2001, D. Dale Gulledge wrote:
>> > Is there a reliable tool for determining the encoding of a file?
>> No, but here are a few idea for whoever wants to make a good one
>> (probably to be contributed for the GNU "file" utility):
>> 
>> Depending on the amount of effort, you can distinguish different
>> encodings quite well as long as the text is long enough for the usual
>> cryptoanalytic techniques for breaking substitution ciphers to work
>> (which means usually >500 characters):


[snip]

About 8 years ago when UTF-8 first emerged (it was called UTF-FSS then
ISTR) I noticed that the usual Japanese code detection utilities
invariably thought text containing UTF-8 was in Shift-JIS. So I
wrote my own utility which reliably differentiated between UTF-8,
Shift-JIS, EUC-JP and ISO-2022-JP. It didn't do anything fancy; except that
it reversed the usual policy of looking for codes in a particular set.
Instead it started with all possible sets as candidates and eliminated
them each time it found one that didn't fit. It stopped once there was
only one code left. This approach worked fine for the codes and coding
mentioned above, as each code has a range where it alone is legal. I don't 
know how it would go if ISO-8859-* were added to the mix. I must dust it
off and see.

Cheers

Jim

-- 
Jim Breen  [[EMAIL PROTECTED]  http://www.csse.monash.edu.au/~jwb/]
Computer Science & Software Engineering,                Tel: +61 3 9905 3298
P.O Box 26, Monash University,                          Fax: +61 3 9905 5146
Clayton VIC 3800, Australia      $B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Automatic encoding guessing

Reply via email to