John Oliver wrote: > On Mon, Mar 19, 2007 at 01:36:48PM -0700, Andrew Lentvorski wrote: >> John Oliver wrote: >>> I just had someone ask me why none of her files in Chinese, Japanese, >>> etc. no longer show up in the correct character sets. Everything is >>> UTF-8, and up until this morning, everything "worked right". She could, >>> for example, more a file with Chinese characters and see it correctly, >>> then more a file with Japanese characters and see it correctly. I would >>> have assumed that something would have to change the LANG variable for >>> that to work, but don't see how. >> LANG could be the culprit. LOCALE could also be the culprit. > > AFAIK, LANG never changes. It's always en_US.UTF-8 I don't see how > more-ing a given file could somehow "know" the "correct" character set > and change LANG for it, and then change it back. > > As for LOCALE... I noticed that when I ssh into her machine and su to > her account and run locale, I get > > [EMAIL PROTECTED] ~]$ locale > LANG=en_US.UTF-8 > LC_CTYPE="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_PAPER="en_US.UTF-8" > LC_NAME="en_US.UTF-8" > LC_ADDRESS="en_US.UTF-8" > LC_TELEPHONE="en_US.UTF-8" > LC_MEASUREMENT="en_US.UTF-8" > LC_IDENTIFICATION="en_US.UTF-8" > LC_ALL= > > > In a terminal on that box (under X) everything is "C" > > Where are those values set? > >> However, are you *sure* her files are UTF-8? > > I guess! Everything "worked right" last week. > >> Take a look at the files in Firefox where you can set the character set >> via a menu to make sure. >> >> Quite often Chinese and Japanese files are not in UTF-8 (especially if >> the originate from China or Japan). > > Everything here is, since it's all for translation work... matching > sentences in at least two different languages. >
I believe it helps simplify the matter to be just a bit pedantic, and point out that _locale_ is a larger concept, and it's just the character-type part of locale (LC_CTYPE) you are concerned with. There are, however some "container-like" terms LC_ALL and LANG which (roughly speaking) stand for "everything". 'man 7 locale' seems to be somewhat readable, although you may wish to peruse 'man -a locale' I'm not quite sure about the details, but the LANG environment variable seems to be set (in fedora) in the functions utility which is sourced near the top of rc.sysinit. The /etc/sysconfig/i18n file specifies the desired LANG value. Evidently the shell explodes the LANG setting to any otherwise unset locale variables. Here's a couple of experiments. ( unset LANG; sh -c "locale" ) ( LANG=fr LC_CTYPE=de sh -c "locale" ) To get the correct display of language-specific characters (aka: glyphs), the data must be properly encoded (in one of possibly several encodings), the program has to be told (or know) how to interpret the data, and the display mechanism has to "handle" the output characters. Somewhere in there is a font operation that translates each specified character to a bitmap. The output & display may use a different encoding than the data & input, by the way. There's lots more to say, but I think I'm drifting, so I'll stop expounding. Andy's question "..are you *sure* her files are .." seems very pertinent. You may want to do a hexdump of a few bytes of the file to try to confirm that. You can use od or hexdump or xxd to display the raw data. I would also second his suggestion to view the file in Firefox, and play with the view>encoding choices. If you don't know what changed from last week, you are stuck with trying to figure out from scratch what needs to be done this week, eh? ;-) Regards, ..jim -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
