Hello, that's funny. A few days ago, I knew nothing about character encoding and was lost like most people about this matter (the "magic" of "why does it work now and not before?!"). And now... I still know very few, but is slightly less lost ;-). And I begin to understand some "notions".
So to explain what I came with, I will be a little verbose, sorry, because as a beginner I want the opinion of "experts". :-) Moreover I will do point by point. This mail's point is simply the guessing of locale codeset. --- the problem --- I think the way how we find the encoding is not good. In encoding.c: the function rxvt_get_encoding_from_locale get the locale and take the part of the string after the '.' to guess the encoding. For instance: if LC_CTYPE=fr_FR.utf8 => codeset = utf8 This is a wrong behaviour for several reasons. --- the reasons --- 1/ there can be a modifier with '@' after the encoding. See http://www.opengroup.org/onlinepubs/007908799/xbd/envvar.html [EMAIL PROTECTED] This is for instance the case on my own encoding. This is not the main problem because then it is also easy to detect such a '@' and take the substring between '.' and '@' as a new behaviour. So easy to fix. 2/ the locale naming may be different: from the same link: "Additional criteria for determining a valid locale name are implementation-dependent." On some paper about locales on the net (among them some debian manuals: http://www.debian.org/doc/manuals/debian-euro-support/ch-configure.en.html), I saw that it happens that alias are made for locales. All the papers tell this is not advised to set alias for locales. But the fact is that it is possible, should work, hence we should not depend on the consideration that it is not recommended in my opinion. Moreover when a naming is implementation dependent, trying to be exhaustive on all the possible naming (so all possible implementations) is bad because it is impossible. Moreover even if it was possible, we would still miss the users' customizations. So we must use standardized tools with common interface and output instead. 3/ the encoding has many orthographies I made a test with: $ localedef -i ja_JP -f EUC-JP ja_JP.EUC-JP $ localedef -i en_US -f ISO-8859-15 en_US.ISO-8859-15 and then: $ locale -a ... en_US.iso885915 ja_JP.eucjp ... So the localedef implementation on my platform transformed the names I gave him!!! Then in our mrxvt implementation, this works because we use strcasecmp (which ignore the case) and that the array known_encodings try to be exhaustive with many possibilities: ISO8859-15, ISO885915 and ISO_8859-15, as well as EUCJ, EUCJP, EUC-JP and UJIS. So this is a lot of redundancy, larger array to process and more difficult to maintain (if I modify one of the elements, I must not forget the other synonyms). 4/ even with this locale naming, the ".codeset" part is optional. Many implementations won't precise the encoding in the name of the locale, they will only specify it during the locale definition. For instance, on my system, I have a locale created this way: localedef -i fr_Fr -f ISO-8859-15 [EMAIL PROTECTED] - Conclusion: So in my case, mrxvt will never guess the right encoding with [EMAIL PROTECTED] (which respects the norms though): codeset not in the locale naming and presence of the modifier '@euro'. I always arrive to a default codeset: ENC_NOENC! :-( [note: This leads to DEFAULT_XFT_FONT_NAME = "Monospace" as fallback font (anyway all ISO-8859-X get this font, which is a wrong behaviour also I think but is a different issue, for another email). --- Some clues --- The locale command gives the right codeset: $ [EMAIL PROTECTED] locale charmap ISO-8859-15 $ LC_ALL=en_US.iso885915 locale charmap ISO-8859-15 The advantage of this output? It is normalized! We don't have to process the cases where you get iso885915 or iso_8859-15 or any other "version" with upper/lowercase, '-', '_', etc. "locale charmap" always gives the same result (so no need of a redundant known_encodings array, and I don't care about the modification made by localedef on my naming, nor any personal alias made by a user). And even with a wrong name, you get a good codeset. I made some test to be sure. If your locale does not exist, you still get an existing default codeset on your standard output: $ locale locale: Cannot set LC_CTYPE to default locale: No such file or directory locale: Cannot set LC_MESSAGES to default locale: No such file or directory locale: Cannot set LC_ALL to default locale: No such file or directory LANG=fr_FR.utf8 LC_CTYPE="en_US.iso88591" LC_NUMERIC="en_US.iso88591" LC_TIME="en_US.iso88591" LC_COLLATE="en_US.iso88591" LC_MONETARY="en_US.iso88591" LC_MESSAGES="en_US.iso88591" LC_PAPER="en_US.iso88591" LC_NAME="en_US.iso88591" LC_ADDRESS="en_US.iso88591" LC_TELEPHONE="en_US.iso88591" LC_MEASUREMENT="en_US.iso88591" LC_IDENTIFICATION="en_US.iso88591" LC_ALL=en_US.iso88591 The errors seen here are because LC_ALL=en_US.iso88591 is NOT an encoding configured on my computer. Yet I have been able to assign it to LC_ALL (which is just a variable without after all!), but this does not mean it has a meaning. And mrxvt in the current implementation will use it (though it should not). But the "locale" command knows this is not possible. So when I ask the charmap, it will give some default one: $ locale charmap locale: Cannot set LC_CTYPE to default locale: No such file or directory locale: Cannot set LC_MESSAGES to default locale: No such file or directory locale: Cannot set LC_ALL to default locale: No such file or directory ANSI_X3.4-1968 (Note: only the string "ANSI_X3.4-1968" == ASCII is in the standard output. The rest is the error output. So this output could be used from a program) --- My solution --- At first I could not find how to get this charmap from a C program, except than make a system call to run "locale charmap". But this kind of solution is never as good as running some C function. But thanks to the urxvt code, I found this function: nl_langinfo http://www.opengroup.org/onlinepubs/009695399/functions/nl_langinfo.html This is from the include file langinfo.h which is apparently from the POSIX specification. So this should be on most systems. The result to nl_langinfo (CODESET) is guaranteed to return a char* string containing the exact same result as "locale charmap" (make a "man nl_langinfo, this is said here and amont other places). Then I propose to use this instead of parsing the locale name. Anyway the library is still loaded (but used nowhere, probably some garbage from old code?) in rxvt.h: 392 #ifdef HAVE_NL_LANGINFO 393 # include <langinfo.h> 394 #endif Of course to be more portable, the better would be to do a #ifdef HAVE_NL_LANGINFO // code with nl_langinfo (CODESET) #else // current code #endif And in configure.ac, I added this (I don't understand all, I took from urxvt configure.ac. But should say that the library is here, no?): AC_CACHE_CHECK(for working nl_langinfo, rxvt_cv_func_nl_langinfo, [AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include <langinfo.h>]], [[nl_langinfo(CODESET);]])],[rxvt_cv_func_nl_langinfo=yes],[rxvt_cv_func_nl_langinfo=no])]) if test x$rxvt_cv_func_nl_langinfo = xyes; then AC_DEFINE(HAVE_NL_LANGINFO, 1, Define if nl_langinfo(CODESET) works) fi If this is OK for you, I propose to commit this on svn (already modified locally on my computer). The changes are working in my case: $ locale LANG=fr_FR.utf8 LC_CTYPE="[EMAIL PROTECTED]" LC_NUMERIC="[EMAIL PROTECTED]" LC_TIME="[EMAIL PROTECTED]" LC_COLLATE="[EMAIL PROTECTED]" LC_MONETARY="[EMAIL PROTECTED]" LC_MESSAGES="[EMAIL PROTECTED]" LC_PAPER="[EMAIL PROTECTED]" LC_NAME="[EMAIL PROTECTED]" LC_ADDRESS="[EMAIL PROTECTED]" LC_TELEPHONE="[EMAIL PROTECTED]" LC_MEASUREMENT="[EMAIL PROTECTED]" LC_IDENTIFICATION="[EMAIL PROTECTED]" [EMAIL PROTECTED] $ ./mrxvt -dlevel debug -dmask encoding [before the change] Debug mask: 0x00002000, debug level: 5 set default locale to [EMAIL PROTECTED] set multichar encoding to noenc rxvt_set_default_font_x11 ... $ ./mrxvt -dlevel debug -dmask encoding [after the change] Debug mask: 0x00002000, debug level: 5 set default locale to [EMAIL PROTECTED] set multichar encoding to ISO-8859-15 rxvt_set_default_font_x11 ... Note: I saw that rxvt_get_encoding_from_locale is called only from rxvt_extract_resources in xdefaults.c (which call then rxvt_set_multichar_encoding from encoding.c, this one setting r->h->encoding_method by comparing the string resulted with the struct known_encodings.), and only if the macro MULTICHAR_SET is defined. I have not seen what happens (and how the locale is set) when there is not this macro. I have seen many other stuffs looking problematic but this mail is already too long. So for further mails. :-) Jey ------------------------------------------------------------------------- SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Materm-devel mailing list Materm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/materm-devel mrxvt home page: http://materm.sourceforge.net