Jim Breen wrote: > About 8 years ago when UTF-8 first emerged (it was called UTF-FSS then > ISTR) I noticed that the usual Japanese code detection utilities > invariably thought text containing UTF-8 was in Shift-JIS. So I > wrote my own utility which reliably differentiated between UTF-8, > Shift-JIS, EUC-JP and ISO-2022-JP. It didn't do anything fancy; except that > it reversed the usual policy of looking for codes in a particular set. > Instead it started with all possible sets as candidates and eliminated > them each time it found one that didn't fit. It stopped once there was > only one code left. This approach worked fine for the codes and coding > mentioned above, as each code has a range where it alone is legal. I don't > know how it would go if ISO-8859-* were added to the mix. I must dust it > off and see.
I like your solution. However, even that won't solve one problem with the character sets and the way Emacs handles them. 7 bit ASCII is a valid subset of an enormous number of character sets. If you have a file containing only ASCII characters that is intended to be encoded in a more complete character set, it requires input from the user to distinguish which one to use. Now we come to the Emacs issue. The characters in UTF-8 and ISO 8859-x have not been unified in Emacs' internal representation. Thus, to insert the "same character" in UTF-8 or ISO 8859-3 requires inserting a different character into the buffer. Choosing an encoding that isn't compatible with the characters inserted by your input method can be a pain. I've done this more than once with precisely the two character sets I named because I am maintaining files in both encodings. My point is that there isn't a universal solution so long as we want support for the character sets we are discussing. -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
