On Mon, 22 Oct 2001, David Starner wrote: > I'm apparently naive in expecting that opening up a UTF-8 file > in a UTF-8 locale would get emacs to display it as a UTF-8 file.
Unfortunately, that doesn't work right out-of-the-box yet. Elisp has at the moment no direct way of accessing the output of nl_langinfo(CODESET), therefore Emacs doesn't know about the current locale's character set and can't consider this information when deciding on the character set of a loaded file. Gerd Moellmann <[EMAIL PROTECTED]> said that fixing this would already be on the post-21 todo list. Until then, may be some nice elisp guru can come up with a few lines for your ~/.emacs file that calls the shell command line `locale charmap` and sets the default encoding for every loaded file according to the output. The few cases I'm interested in under glibc 2.2 are just: $ LC_CTYPE=C locale charmap ANSI_X3.4-1968 $ LC_CTYPE=en_GB locale charmap ISO-8859-1 $ LC_CTYPE=en_GB.UTF-8 locale charmap UTF-8 $ LC_CTYPE=pl_PL locale charmap ISO-8859-2 $ LC_CTYPE=ru_RU locale charmap ISO-8859-5 $ LC_CTYPE=el_GR locale charmap ISO-8859-7 $ LC_CTYPE=de_DE locale charmap ISO-8859-15 $ LC_CTYPE=ja_JP locale charmap EUC-JP $ LC_CTYPE=zh_CN locale charmap GB2312 $ LC_CTYPE=ko_KR locale charmap EUC-KR I made a little survey of the strings that nl_langinfo(CODESET) or `locale charmap` returns on different operating systems or that have been proposed as locale name draft standards: Good old 7-bit USASCII comes in the largest number of names: ASCII = US-ASCII = ANSI_X3.4-1968 = 646 = ISO646 = ISO_646.IRV For ISO 8859-1 (and correspondingly the other parts) there are only two notations in use: ISO8859-1 = ISO-8859-1 And then there are also UTF-8 TIS-620 = TIS620.2533 = ISO-8859-12 = ISO8859-12 EUC-JP EUC-KR EUC-TW EUC-CN = GB2312 VSCII GB18030 GBK BIG5 = Big5 KOI8-R KOI8-U WINDOWS-1251 WINDOWS-1256 Also of interest for web page maintainers might be a ~/.emacs line that first tests whether the same directory as a to-be-loaded *.html (or correspondingly *.txt) file contains in .htaccess the corresponding line AddType text/html;charset=UTF-8 html AddType text/plain;charset=UTF-8 txt and if so then assumes that the loaded *.html or *.txt file is encoded accordingly, irrespective of the current locale setting (because Apache gets its character set from .htaccess and not from a user's locale). http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate Any elips hackers who fancy having a look at that challenge? Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
