Re: [gentoo-user] Glibc, userlocales, and ENV Variables

Hans-Werner Hilse Wed, 02 Nov 2005 14:06:04 -0800

Hi,

On Wed, 02 Nov 2005 21:16:49 +0100
Holly Bostick <[EMAIL PROTECTED]> wrote:


> OK, I kinda get that.... and dmesg says during boot that the terminal
> (agetty) is being configured to use UTF-8 (which is what I told it to do
> when I built the kernel, so that's OK).

The kernel is configured by the gentoo rc system using "unicode_start".
This sets console charmap & font.

> So does that mean that when I log in to my DE/WM, and start X, the
> charmap will be automatically UTF-8, because that's what the getty was?

No, that's independent. Your X terminal program talks to X and uses its
font subsystem. That also uses charset information for finding correct
fonts. On the other hand there are other means to use fonts now and
some solutions do an intermediate mapping to unicode.

> It's not clear to me whether the Euro symbol is included in UTF-8
> encodings, or only as a special variant of ISO-8859-15 (the "@euro"
> variant), which is one of the reasons I try to encode both.

it's both in UTF-8 (which includes every sign under the sun - almost)
and ISO-8859-15 which is the same as "latin9" (hint: look for this when
searching console fonts!) which is a slightly modified ISO-8859-1
a.k.a. latin1. 

> > [EMAIL PROTECTED]/ISO-8859-1 didn't make much sense to me (and maybe causes
> >  some failures when building?), but other from that it seemed OK.
> 
> Well, of course I know less about this than you do [...

oooh, I know this what I am writing here since only a few minutes, not
longer. In fact, the whole locale setup is terribly bad documented.

> ...] , but my native Dutch boyfriend runs a English Windows machine, I 
> run Windows programs with Wine, and about the only thing I think I know 
> about the whole issue is that Windows pretty much only knows ISO-8859-1 
> (unless you had a multi-lingual version, which neither of us did). So I 
> wanted support for ISO-8859-1 to be available (with support for the Euro
> symbol for those MS fonts that support it, which I think that the core 
> MS fonts now do by default, though I'm not sure about that either).

Windows uses Unicode since 2000 (or even NT?). However, that doesn't
mean it's shipping fonts with the full Unicode charset available.

> In any case, if such an application called for ISO-8859-1 , I wanted it
> to be there, though as you can tell, I don't get how this is all
> supposed to work well enough to be sure that was the way to accomplish
> the goal.

Very interesting this whole stuff. Actually, I'm just reading my way
through the glibc sources as I'd always been interested in this. And it
is _very_ bad documented. I've mentioned this....

In fact, there's no difference between the nl_NL and the [EMAIL PROTECTED]
locale. I think probably all of the @euro locales are more or less
obsolete now. I think they're a remainder from the time when the new
currency was introduced and the user had to choose. Now, the @euro
locales de facto just import there [EMAIL PROTECTED] counterparts. This is
written in the relevant changelog:

* locales/br_FR: Eliminate old national currencies of countries
  participating in Euro.  Make @euro files pure copies.
(continues for all @euro)

A .UTF-8 locale doesn't exist in glibc's locale database so it must get
stripped when the locale is generated.

To give you a hint, the default locales generated for language tag nl,
subtag NL are

nl_NL
[EMAIL PROTECTED]
nl_NL.UTF8

in your locales.build, this should read

nl_NL/ISO-8859-1
[EMAIL PROTECTED]/ISO-8859-15
nl_NL.UTF-8/UTF-8

to have the proper locale for each encoding you/your terminal may come
across. Although @euro is a copy, it is needed to identify and
distinguish each of the generated locales. After all, nl_NL... is just
a name, could have been another name, too. But the "LANG" setting is
also used by gettext (which has nothing to do with this all, but has
the same author), AFAIK, and thus shouldn't be totally arbitrary chosen.

Your LANG setting should be "[EMAIL PROTECTED]" for non-unicode environments
(given that you're using latin9/ISO-8859-15 fonts) and "nl_NL.UTF-8" in
unicode environments.

> > How does the "borkism" of your locales manifest?
> 
> Most of the time, when Dutch characters are meant to be used, they are,
> as in the following example:
> 
> [EMAIL PROTECTED] -> killall -9 conky
> conky: geen proces beëindigd
> 
> but sometimes I get this:
> 
>  killall -9 MPlayer
> MPlayer: geen proces beëindigd
> 
> ..... now *that's* interesting... I copied and pasted the second from a
> terminal (mrvxt, whereas the first was from multi-gnome-terminal), where
> what appeared was
> 
>  killall -9 MPlayer
> MPlayer: geen proces beA(with the ~ over it) << (but tiny ones)indigd
> 
> in place of the ë . But when I pasted it into this compose window, it
> came out right! But it isn't in the term.

This is probably due to different clipboard implementations. GTK has
its own clipboard which imports things from the X clipboard facility
that mrxvt is using. Probably GTK applies some logic like recognizing
multibyte sequences and decides that it is most probably UTF-8 before
putting that in its clipboard.

This is probably caused by mrxvt not using Unicode when LANG is set to
a unicode locale.

> And sometimes the same thing happens in X programs (depending on....
> something I now perhaps have a hint about, but am not sure), the "Copy"
> command (in Dutch, "Kopiëren") appears with the same mangling of the ë
> character.

Then the font subsystem isn't prepared to receive UTF-8 at that point.
This is not locale-, but charset-related.

> But I've just noticed that when I tried to copy and paste the 'borked'
> output to text and then copy it to the compose window (which still
> pasted correctly, which was for once not what I wanted), that I used
> gnotepad+ (for speed) rather than gedit-- and gnotepad+ displays the
> lack of Unicode support as well (Kopiëren borked).

Text editors are another story, too. They might support different
output charsets for the files or they might not. In the latter case,
most of them will only use a one-byte encoding.

> Is that because it's a GTK+ 1 program? (That's really all it could be,
> seems to me.) GTK+-1.2.10-r11 is compiled with nls support, but that
> doesn't mean unicode support?
> 
> What a mess... does this mean I have to set GTK 1 somehow to use
> ISO-8859-15 and all the 'modern' programs to use UTF-8 as they do?

Hm, I have no idea whether this is GTK-version related. My guess would
be it isn't. It's probably only confusion between clipboards and
applications and their storage backends.

> Or give up/prune any GTK 1 programs I might use, and solve the problem
> that way?

Do you really need unicode, anyway?

> I mean, is it really so much to ask that accented and special characters
> appear correctly no matter what program I'm using? It's not like there's
> so many of them!!! But I have to tie my system in knots to get it?

On the console, the problem is mostly solved. X is a different beast,
though...

> How do you do it? I presume that the bulk of your system is displayed in
> German.

Nope. I found an english environment comfortable enough, at least (not
locale- but gettext-related) it has more up-to-date documentations and
i can find all the error messages in the sources instead of translated
message databases...

> Sorry, I'm getting a bit frazzled by this, and I'm annoyed because I
> don't think this should be a frazzle-worthy issue, but I've been
> struggling with it off and on for the past three-and-a-half years, and
> it's about time to get a handle on it.

Not an easy task. Maybe it's time to write an extensive article on this
at some Linux Magazine... But one would have a hard time, given the
apparent lack of extensive documentation...


-hwh

-- 
gentoo-user@gentoo.org mailing list

Re: [gentoo-user] Glibc, userlocales, and ENV Variables

Reply via email to