[Markus Kuhn (Re: Automatic encoding guessing) writes:] >> On Tue, 23 Oct 2001, D. Dale Gulledge wrote: >> > Is there a reliable tool for determining the encoding of a file? >> No, but here are a few idea for whoever wants to make a good one >> (probably to be contributed for the GNU "file" utility): >> >> Depending on the amount of effort, you can distinguish different >> encodings quite well as long as the text is long enough for the usual >> cryptoanalytic techniques for breaking substitution ciphers to work >> (which means usually >500 characters):
[snip] About 8 years ago when UTF-8 first emerged (it was called UTF-FSS then ISTR) I noticed that the usual Japanese code detection utilities invariably thought text containing UTF-8 was in Shift-JIS. So I wrote my own utility which reliably differentiated between UTF-8, Shift-JIS, EUC-JP and ISO-2022-JP. It didn't do anything fancy; except that it reversed the usual policy of looking for codes in a particular set. Instead it started with all possible sets as candidates and eliminated them each time it found one that didn't fit. It stopped once there was only one code left. This approach worked fine for the codes and coding mentioned above, as each code has a range where it alone is legal. I don't know how it would go if ISO-8859-* were added to the mix. I must dust it off and see. Cheers Jim -- Jim Breen [[EMAIL PROTECTED] http://www.csse.monash.edu.au/~jwb/] Computer Science & Software Engineering, Tel: +61 3 9905 3298 P.O Box 26, Monash University, Fax: +61 3 9905 5146 Clayton VIC 3800, Australia $B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
