Re: [gentoo-user] Glibc, userlocales, and ENV Variables

2005-11-02 Thread Hans-Werner Hilse
Hi,

On Wed, 02 Nov 2005 15:53:11 +0100
Holly Bostick [EMAIL PROTECTED] wrote:

 [...]
 /etc/locales.build
 
 which says
 
 # This file names the list of locales to be built when glibc is installed.
 # The format is locale/charmap, where locale is a locale from the
 # /usr/share/i18n/locales directory, and charmap is name of one of the files
 # in /usr/share/i18n/charmaps/. All blank lines and lines starting with # are
 # ignored. Here is an example:
 # en_US/ISO-8859-1
 [...]
 Glibc built fine (afaict), but my problem is that I now don't know what
 to export with a LANG variable.
 
 For example, if I want [EMAIL PROTECTED]/UTF-8, how do I export that as 
 opposed
 to [EMAIL PROTECTED]/ISO-8859-15 (or worse, ISO-8859-1)?

Note the comment you've cited: The format is locale/charmap. This
generates the locale data for a certain language (it's a little bit
more than just language, though) for the specified charmap.

In LANG/LC_* you only set the locale. The charmap is (semi-)
automatically chosen, which makes sense, since it's terminal dependant
which charset is used.
 
 Was I supposed to give the locales individual names as the Localization
 Guide implies? locales.build doesn't indicate that you can do that (and
 in fact, I thought perhaps the reason why language exports were mildly
 borked might be because I had done so).

[EMAIL PROTECTED]/ISO-8859-1 didn't make much sense to me (and maybe causes
some failures when building?), but other from that it seemed OK.

 Should I just get rid of the 'extra' locales (ISO-8859-15 and
 ISO-8859-1)? Since I guess I'm going to try to stick to UTF-8, maybe I
 don't really need them (I was mostly covering my butt, concerned that my
 current and future network connections might not support UTF-8, since
 they're mostly to Windows machines).

All the terminals you're using support UTF-8?

 I guess I've made a mistake, but I'm not quite sure what to do about it.
 Since fixing it will most almost certainly require a recompile of glibc,
 and since compiling glibc takes nine-tenths of forever, I'd like to get
 it on with it as soon as possible (sigh). So any hints would be appreciated.

How does the borkism of your locales manifest?


-hwh
-- 
gentoo-user@gentoo.org mailing list



Re: [gentoo-user] Glibc, userlocales, and ENV Variables

2005-11-02 Thread Holly Bostick
Hans-Werner Hilse schreef:
 Hi,
 
 On Wed, 02 Nov 2005 15:53:11 +0100 Holly Bostick [EMAIL PROTECTED] 
 wrote:
 
 
 [...] /etc/locales.build
 
 which says
 
 # This file names the list of locales to be built when glibc is 
 installed. # The format is locale/charmap, where locale is a 
 locale from the # /usr/share/i18n/locales directory, and charmap 
 is name of one of the files # in /usr/share/i18n/charmaps/. All 
 blank lines and lines starting with # are # ignored. Here is an 
 example: # en_US/ISO-8859-1 [...] Glibc built fine (afaict), but my
  problem is that I now don't know what to export with a LANG 
 variable.
 
 For example, if I want [EMAIL PROTECTED]/UTF-8, how do I export that as 
 opposed to [EMAIL PROTECTED]/ISO-8859-15 (or worse, ISO-8859-1)?
 
 
 Note the comment you've cited: The format is locale/charmap. This 
 generates the locale data for a certain language (it's a little bit
  more than just language, though) for the specified charmap.
 
 In LANG/LC_* you only set the locale. The charmap is (semi-) 
 automatically chosen, which makes sense, since it's terminal 
 dependant which charset is used.

OK, I kinda get that and dmesg says during boot that the terminal
(agetty) is being configured to use UTF-8 (which is what I told it to do
when I built the kernel, so that's OK).

So does that mean that when I log in to my DE/WM, and start X, the
charmap will be automatically UTF-8, because that's what the getty was?

I want the full ISO-8859-15 charset and the Euro symbol. UTF-8 gets me
the charset, but afaik I need some attachment to @euro to get the Euro
symbol (for those fonts that even have the character(s), which is
another horror show that I won't get into, since once you've found a
reasonably attractive font with all the characters, half the time it
doesn't have bold or italic or bold italic, so it's not very useful on
the desktop a horror show).

It's not clear to me whether the Euro symbol is included in UTF-8
encodings, or only as a special variant of ISO-8859-15 (the @euro
variant), which is one of the reasons I try to encode both.

 
 
 Was I supposed to give the locales individual names as the 
 Localization Guide implies? locales.build doesn't indicate that you
  can do that (and in fact, I thought perhaps the reason why 
 language exports were mildly borked might be because I had done 
 so).
 
 
 [EMAIL PROTECTED]/ISO-8859-1 didn't make much sense to me (and maybe causes
  some failures when building?), but other from that it seemed OK.

Well, of course I know less about this than you do, but my native Dutch
boyfriend runs a English Windows machine, I run Windows programs with
Wine, and about the only thing I think I know about the whole issue is
that Windows pretty much only knows ISO-8859-1 (unless you had a
multi-lingual version, which neither of us did). So I wanted support for
ISO-8859-1 to be available (with support for the Euro symbol for those
MS fonts that support it, which I think that the core MS fonts now do by
default, though I'm not sure about that either).

In any case, if such an application called for ISO-8859-1 , I wanted it
to be there, though as you can tell, I don't get how this is all
supposed to work well enough to be sure that was the way to accomplish
the goal.

 
 
 Should I just get rid of the 'extra' locales (ISO-8859-15 and 
 ISO-8859-1)? Since I guess I'm going to try to stick to UTF-8, 
 maybe I don't really need them (I was mostly covering my butt, 
 concerned that my current and future network connections might not 
 support UTF-8, since they're mostly to Windows machines).
 
 
 All the terminals you're using support UTF-8?

Well, I thought so, but maybe I was wrong. I use mostly
multi-gnome-terminal (which does appear to have unicode support by
default), but when I switched window managers to fvwm-crystal, I started
using mrxvt and aterm a bit more (because fvwm-crystal likes them, and
xterm-- which crystal also likes-- takes forever to open for some
reason, likely unrelated but very annoying). This may well be when I
started noticing this as a problem rather than an annoyance, because I
was suddenly seeing it so much. Previously, the issue had only raised
its ugly head in some X programs, but not X programs I use that often,
so it was easy to ignore.

None of the terms I use have a unicode USE flag, but I have been by the
homepages. Now I see that support for CJK does not mean that UTF is
automatically supported; it seems that mrvxt does not support unicode,
nor do aterm/multi-aterm/rvxt.

OK, that answers that, I guess, but what did you Europeans do when
these terminals were all you had, for Pete's sake? Your output would
have been half-gibberish, and I don't see how people would have stood
for that.

 
 
 I guess I've made a mistake, but I'm not quite sure what to do 
 about it. Since fixing it will most almost certainly require a 
 recompile of glibc, and since compiling glibc takes nine-tenths of 
 forever, I'd like to get it on 

Re: [gentoo-user] Glibc, userlocales, and ENV Variables

2005-11-02 Thread Hans-Werner Hilse
Hi,

On Wed, 02 Nov 2005 21:16:49 +0100
Holly Bostick [EMAIL PROTECTED] wrote:

 OK, I kinda get that and dmesg says during boot that the terminal
 (agetty) is being configured to use UTF-8 (which is what I told it to do
 when I built the kernel, so that's OK).

The kernel is configured by the gentoo rc system using unicode_start.
This sets console charmap  font.

 So does that mean that when I log in to my DE/WM, and start X, the
 charmap will be automatically UTF-8, because that's what the getty was?

No, that's independent. Your X terminal program talks to X and uses its
font subsystem. That also uses charset information for finding correct
fonts. On the other hand there are other means to use fonts now and
some solutions do an intermediate mapping to unicode.

 It's not clear to me whether the Euro symbol is included in UTF-8
 encodings, or only as a special variant of ISO-8859-15 (the @euro
 variant), which is one of the reasons I try to encode both.

it's both in UTF-8 (which includes every sign under the sun - almost)
and ISO-8859-15 which is the same as latin9 (hint: look for this when
searching console fonts!) which is a slightly modified ISO-8859-1
a.k.a. latin1. 

  [EMAIL PROTECTED]/ISO-8859-1 didn't make much sense to me (and maybe causes
   some failures when building?), but other from that it seemed OK.
 
 Well, of course I know less about this than you do [...

oooh, I know this what I am writing here since only a few minutes, not
longer. In fact, the whole locale setup is terribly bad documented.

 ...] , but my native Dutch boyfriend runs a English Windows machine, I 
 run Windows programs with Wine, and about the only thing I think I know 
 about the whole issue is that Windows pretty much only knows ISO-8859-1 
 (unless you had a multi-lingual version, which neither of us did). So I 
 wanted support for ISO-8859-1 to be available (with support for the Euro
 symbol for those MS fonts that support it, which I think that the core 
 MS fonts now do by default, though I'm not sure about that either).

Windows uses Unicode since 2000 (or even NT?). However, that doesn't
mean it's shipping fonts with the full Unicode charset available.

 In any case, if such an application called for ISO-8859-1 , I wanted it
 to be there, though as you can tell, I don't get how this is all
 supposed to work well enough to be sure that was the way to accomplish
 the goal.

Very interesting this whole stuff. Actually, I'm just reading my way
through the glibc sources as I'd always been interested in this. And it
is _very_ bad documented. I've mentioned this

In fact, there's no difference between the nl_NL and the [EMAIL PROTECTED]
locale. I think probably all of the @euro locales are more or less
obsolete now. I think they're a remainder from the time when the new
currency was introduced and the user had to choose. Now, the @euro
locales de facto just import there [EMAIL PROTECTED] counterparts. This is
written in the relevant changelog:

* locales/br_FR: Eliminate old national currencies of countries
  participating in Euro.  Make @euro files pure copies.
(continues for all @euro)

A .UTF-8 locale doesn't exist in glibc's locale database so it must get
stripped when the locale is generated.

To give you a hint, the default locales generated for language tag nl,
subtag NL are

nl_NL
[EMAIL PROTECTED]
nl_NL.UTF8

in your locales.build, this should read

nl_NL/ISO-8859-1
[EMAIL PROTECTED]/ISO-8859-15
nl_NL.UTF-8/UTF-8

to have the proper locale for each encoding you/your terminal may come
across. Although @euro is a copy, it is needed to identify and
distinguish each of the generated locales. After all, nl_NL... is just
a name, could have been another name, too. But the LANG setting is
also used by gettext (which has nothing to do with this all, but has
the same author), AFAIK, and thus shouldn't be totally arbitrary chosen.

Your LANG setting should be [EMAIL PROTECTED] for non-unicode environments
(given that you're using latin9/ISO-8859-15 fonts) and nl_NL.UTF-8 in
unicode environments.

  How does the borkism of your locales manifest?
 
 Most of the time, when Dutch characters are meant to be used, they are,
 as in the following example:
 
 [EMAIL PROTECTED] - killall -9 conky
 conky: geen proces beëindigd
 
 but sometimes I get this:
 
  killall -9 MPlayer
 MPlayer: geen proces beëindigd
 
 . now *that's* interesting... I copied and pasted the second from a
 terminal (mrvxt, whereas the first was from multi-gnome-terminal), where
 what appeared was
 
  killall -9 MPlayer
 MPlayer: geen proces beA(with the ~ over it)  (but tiny ones)indigd
 
 in place of the ë . But when I pasted it into this compose window, it
 came out right! But it isn't in the term.

This is probably due to different clipboard implementations. GTK has
its own clipboard which imports things from the X clipboard facility
that mrxvt is using. Probably GTK applies some logic like recognizing
multibyte sequences and decides that