[Materm-devel] First issue: guessing the locale encoding

Jehan Tue, 11 Dec 2007 11:48:06 -0800

Hello,

that's funny. A few days ago, I knew nothing about character encoding
and was lost like most people about this matter (the "magic" of "why
does it work now and not before?!"). And now... I still know very few,
but is slightly less lost ;-). And I begin to understand some "notions".


So to explain what I came with, I will be a little verbose, sorry,
because as a beginner I want the opinion of "experts". :-)
Moreover I will do point by point. This mail's point is simply the
guessing of locale codeset.


---
the problem
---

I think the way how we find the encoding is not good.

In encoding.c: the function rxvt_get_encoding_from_locale get the locale
and take the part of the string after the '.' to guess the encoding.

For instance: if LC_CTYPE=fr_FR.utf8 => codeset = utf8
This is a wrong behaviour for several reasons.

---
the reasons
---

1/ there can be a modifier with '@' after the encoding.

See http://www.opengroup.org/onlinepubs/007908799/xbd/envvar.html

[EMAIL PROTECTED]

This is for instance the case on my own encoding.
This is not the main problem because then it is also easy to detect such
a '@' and take the substring between '.' and '@' as a new behaviour. So
easy to fix.

2/ the locale naming may be different:

from the same link: "Additional criteria for determining a valid locale
name are implementation-dependent."
On some paper about locales on the net (among them some debian manuals:
http://www.debian.org/doc/manuals/debian-euro-support/ch-configure.en.html),

I saw that it happens that alias are made for locales. All the papers
tell this is not advised to set alias for locales. But
the fact is that it is possible, should work, hence we should not depend
on the consideration that it is not recommended in my opinion.

Moreover when a naming is implementation dependent, trying to be
exhaustive on all the possible naming (so all possible implementations)
is bad because it is impossible. Moreover even if it was possible, we
would still miss the users' customizations. So we must use standardized
tools with common interface and output instead.

3/ the encoding has many orthographies

I made a test with:
$ localedef -i ja_JP -f EUC-JP ja_JP.EUC-JP
$ localedef -i en_US -f ISO-8859-15 en_US.ISO-8859-15
and then:
$ locale -a
...
en_US.iso885915
ja_JP.eucjp
...

So the localedef implementation on my platform transformed the names I
gave him!!!
Then in our mrxvt implementation, this works because we use strcasecmp
(which ignore the case) and that the array known_encodings try to be
exhaustive with many possibilities: ISO8859-15, ISO885915 and
ISO_8859-15, as well as EUCJ, EUCJP, EUC-JP and UJIS. So this is a lot
of redundancy, larger array to process and more difficult to maintain
(if I modify one of the elements, I must not forget the other synonyms).

4/ even with this locale naming, the ".codeset" part is optional.
Many implementations won't precise the encoding in the name of the
locale, they will only specify it during the locale definition. For
instance, on my system, I have a locale created this way:

localedef -i fr_Fr -f ISO-8859-15 [EMAIL PROTECTED]

- Conclusion:

So in my case, mrxvt will never guess the right encoding with
[EMAIL PROTECTED] (which respects the norms though): codeset not in
the locale naming and presence of the modifier '@euro'.

I always arrive to a default codeset: ENC_NOENC! :-(

[note: This leads to DEFAULT_XFT_FONT_NAME = "Monospace" as fallback
font (anyway all ISO-8859-X get this font, which is a wrong behaviour
also I think but is a different issue, for another email).

---
Some clues
---

The locale command gives the right codeset:

$ [EMAIL PROTECTED] locale charmap
ISO-8859-15

$ LC_ALL=en_US.iso885915 locale charmap
ISO-8859-15

The advantage of this output? It is normalized! We don't have to process
the cases where you get iso885915 or iso_8859-15 or any other "version"
with upper/lowercase, '-', '_', etc.

"locale charmap" always gives the same result (so no need of a redundant
known_encodings array, and I don't care about the modification made by
localedef on my naming, nor any personal alias made by a user).

And even with a wrong name, you get a good codeset. I made some test to
be sure. If your locale does not exist, you still get an existing
default codeset on your standard output:

$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=fr_FR.utf8
LC_CTYPE="en_US.iso88591"
LC_NUMERIC="en_US.iso88591"
LC_TIME="en_US.iso88591"
LC_COLLATE="en_US.iso88591"
LC_MONETARY="en_US.iso88591"
LC_MESSAGES="en_US.iso88591"
LC_PAPER="en_US.iso88591"
LC_NAME="en_US.iso88591"
LC_ADDRESS="en_US.iso88591"
LC_TELEPHONE="en_US.iso88591"
LC_MEASUREMENT="en_US.iso88591"
LC_IDENTIFICATION="en_US.iso88591"
LC_ALL=en_US.iso88591

The errors seen here are because LC_ALL=en_US.iso88591 is NOT an
encoding configured on my computer. Yet I have been able to assign it to
LC_ALL (which is just a variable without after all!), but this does not
mean it has a meaning. And mrxvt in the current implementation will use
it (though it should not).
But the "locale" command knows this is not possible.
So when I ask the charmap, it will give some default one:

$ locale charmap
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
ANSI_X3.4-1968

(Note: only the string "ANSI_X3.4-1968" == ASCII is in the standard
output. The rest is the error output. So this output could be used from
a program)

---
My solution
---

At first I could not find how to get this charmap from a C program,
except than make a system call to run "locale charmap". But this kind of
solution is never as good as running some C function.
But thanks to the urxvt code, I found this function: nl_langinfo

http://www.opengroup.org/onlinepubs/009695399/functions/nl_langinfo.html

This is from the include file langinfo.h which is apparently from the
POSIX specification. So this should be on most systems.

The result to nl_langinfo (CODESET) is guaranteed to return a char*
string containing the exact same result as "locale charmap" (make a "man
nl_langinfo, this is said here and amont other places).

Then I propose to use this instead of parsing the locale name. Anyway
the library is still loaded (but used nowhere, probably some garbage
from old code?) in rxvt.h:
 392 #ifdef HAVE_NL_LANGINFO
 393 # include <langinfo.h>
 394 #endif

Of course to be more portable, the better would be to do a
#ifdef HAVE_NL_LANGINFO
// code with nl_langinfo (CODESET)
#else
// current code
#endif

And in configure.ac, I added this (I don't understand all, I took from
urxvt configure.ac. But should say that the library is here, no?):

AC_CACHE_CHECK(for working nl_langinfo, rxvt_cv_func_nl_langinfo,
[AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include <langinfo.h>]],
[[nl_langinfo(CODESET);]])],[rxvt_cv_func_nl_langinfo=yes],[rxvt_cv_func_nl_langinfo=no])])
if test x$rxvt_cv_func_nl_langinfo = xyes; then
  AC_DEFINE(HAVE_NL_LANGINFO, 1, Define if nl_langinfo(CODESET) works)
fi

If this is OK for you, I propose to commit this on svn (already modified
locally on my computer).
The changes are working in my case:

$ locale
LANG=fr_FR.utf8
LC_CTYPE="[EMAIL PROTECTED]"
LC_NUMERIC="[EMAIL PROTECTED]"
LC_TIME="[EMAIL PROTECTED]"
LC_COLLATE="[EMAIL PROTECTED]"
LC_MONETARY="[EMAIL PROTECTED]"
LC_MESSAGES="[EMAIL PROTECTED]"
LC_PAPER="[EMAIL PROTECTED]"
LC_NAME="[EMAIL PROTECTED]"
LC_ADDRESS="[EMAIL PROTECTED]"
LC_TELEPHONE="[EMAIL PROTECTED]"
LC_MEASUREMENT="[EMAIL PROTECTED]"
LC_IDENTIFICATION="[EMAIL PROTECTED]"
[EMAIL PROTECTED]

$ ./mrxvt -dlevel debug -dmask encoding [before the change]
Debug mask: 0x00002000, debug level: 5
set default locale to [EMAIL PROTECTED]
set multichar encoding to noenc
rxvt_set_default_font_x11
...

$ ./mrxvt -dlevel debug -dmask encoding [after the change]
Debug mask: 0x00002000, debug level: 5
set default locale to [EMAIL PROTECTED]
set multichar encoding to ISO-8859-15
rxvt_set_default_font_x11
...


Note: I saw that rxvt_get_encoding_from_locale is called only from
rxvt_extract_resources in xdefaults.c (which call then
rxvt_set_multichar_encoding from encoding.c, this one setting
r->h->encoding_method by comparing the string resulted with the struct
known_encodings.), and only if the macro MULTICHAR_SET is defined. I
have not seen what happens (and how the locale is set) when there is not
this macro.

I have seen many other stuffs looking problematic but this mail is
already too long. So for further mails. :-)

Jey

-------------------------------------------------------------------------
SF.Net email is sponsored by: 
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Materm-devel mailing list
Materm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/materm-devel
mrxvt home page: http://materm.sourceforge.net

[Materm-devel] First issue: guessing the locale encoding

Reply via email to