On Mon, 22 Oct 2001, David Starner wrote:
> I'm apparently naive in expecting that opening up a UTF-8 file
> in a UTF-8 locale would get emacs to display it as a UTF-8 file.

Unfortunately, that doesn't work right out-of-the-box yet.

Elisp has at the moment no direct way of accessing the output of
nl_langinfo(CODESET), therefore Emacs doesn't know about the current
locale's character set and can't consider this information when deciding
on the character set of a loaded file. Gerd Moellmann <[EMAIL PROTECTED]> said
that fixing this would already be on the post-21 todo list.

Until then, may be some nice elisp guru can come up with a few lines for
your ~/.emacs file that calls the shell command line `locale charmap` and
sets the default encoding for every loaded file according to the output.
The few cases I'm interested in under glibc 2.2 are just:

$ LC_CTYPE=C locale charmap
ANSI_X3.4-1968
$ LC_CTYPE=en_GB locale charmap
ISO-8859-1
$ LC_CTYPE=en_GB.UTF-8 locale charmap
UTF-8
$ LC_CTYPE=pl_PL locale charmap
ISO-8859-2
$ LC_CTYPE=ru_RU locale charmap
ISO-8859-5
$ LC_CTYPE=el_GR locale charmap
ISO-8859-7
$ LC_CTYPE=de_DE locale charmap
ISO-8859-15
$ LC_CTYPE=ja_JP locale charmap
EUC-JP
$ LC_CTYPE=zh_CN locale charmap
GB2312
$ LC_CTYPE=ko_KR locale charmap
EUC-KR

I made a little survey of the strings that nl_langinfo(CODESET) or
`locale charmap` returns on different operating  systems or that
have been proposed as locale name draft standards:

Good old 7-bit USASCII comes in the largest number of names:

  ASCII = US-ASCII = ANSI_X3.4-1968 = 646 = ISO646 = ISO_646.IRV

For ISO 8859-1 (and correspondingly the other parts)
there are only two notations in use:

  ISO8859-1 = ISO-8859-1

And then there are also

  UTF-8
  TIS-620 = TIS620.2533 = ISO-8859-12 = ISO8859-12
  EUC-JP
  EUC-KR
  EUC-TW
  EUC-CN = GB2312
  VSCII
  GB18030
  GBK
  BIG5 = Big5
  KOI8-R
  KOI8-U
  WINDOWS-1251
  WINDOWS-1256


Also of interest for web page maintainers might be a ~/.emacs line that
first tests whether the same directory as a to-be-loaded *.html (or
correspondingly *.txt) file contains in .htaccess the corresponding
line

  AddType text/html;charset=UTF-8 html
  AddType text/plain;charset=UTF-8 txt

and if so then assumes that the loaded *.html or *.txt file is encoded
accordingly, irrespective of the current locale setting (because Apache
gets its character set from .htaccess and not from a user's locale).

http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

Any elips hackers who fancy having a look at that challenge?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to