Re: Automatic encoding guessing

D. Dale Gulledge Wed, 24 Oct 2001 17:20:02 -0700

Jim Breen wrote:

> About 8 years ago when UTF-8 first emerged (it was called UTF-FSS then
> ISTR) I noticed that the usual Japanese code detection utilities
> invariably thought text containing UTF-8 was in Shift-JIS. So I
> wrote my own utility which reliably differentiated between UTF-8,
> Shift-JIS, EUC-JP and ISO-2022-JP. It didn't do anything fancy; except that
> it reversed the usual policy of looking for codes in a particular set.
> Instead it started with all possible sets as candidates and eliminated
> them each time it found one that didn't fit. It stopped once there was
> only one code left. This approach worked fine for the codes and coding
> mentioned above, as each code has a range where it alone is legal. I don't
> know how it would go if ISO-8859-* were added to the mix. I must dust it
> off and see.


I like your solution.  However, even that won't solve one problem with
the character sets and the way Emacs handles them.  7 bit ASCII is a
valid subset of an enormous number of character sets.  If you have a
file containing only ASCII characters that is intended to be encoded in
a more complete character set, it requires input from the user to
distinguish which one to use.  Now we come to the Emacs issue.  The
characters in UTF-8 and ISO 8859-x have not been unified in Emacs'
internal representation.  Thus, to insert the "same character" in UTF-8
or ISO 8859-3 requires inserting a different character into the buffer. 
Choosing an encoding that isn't compatible with the characters inserted
by your input method can be a pain.  I've done this more than once with
precisely the two character sets I named because I am maintaining files
in both encodings.

My point is that there isn't a universal solution so long as we want
support for the character sets we are discussing.

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Automatic encoding guessing

Reply via email to