Hello,
The C-library function setlocale is used for changing locales with
attributes like LC_CTYPE LC_COLLATE etc. So when we wish to convert a
multibyte UTF-8 string to a wide char , we can use functions like mbtowc.
But this conversion is based on the value of LC_TYPE which can be set with
the setlocale functions.
COnsider a scenario where we have multiple threads each working in
differnet locales. Then setlocale will affect the other threads too,
because it changes the attributes globaly.
To avoid this we can use the STL in C++ using the <locale> that is
presented namespace std. Here we have various in-built classes like
locale, collate, codecvt etc...
I would like to know if anybody from this mailing list is familiar
this.
The collate template class, has some member funstions like do_compare
and do_transform etc. There is very little information given in the man
pages regarding the usage of these functions. Also how do we specify the
type of collation we need when we are to compare 2 strings, and how is the
inforamtion about the locale passed tp this do_compare function.
I have not encountered with any functions that finds the strlen of
a multibyte character. Of course we have mblen to do this, but this again
depends on the attribute LC_TYPE and changing this would affect other
threads too. One way to overcome this, is to directly use the encoding
schemes to find the length. Given below is a snippet of code that does the
validation of the incoming UTF-8 string and finds the length
utf8_cs::csm_strlen ( char const *instr,
int *strlen) const
{
status_t st = STATUS_OK ;
int index = 0 ;
*strlen = 0 ;
int length_in_bytes = strlen((char *) instr) ;
while(st == STATUS_OK && index <= length_in_bytes)
{
if (instr[index] == 0x00) break ;
if (!(instr[index] & 0x80))
{
(*strlen)++ ;
index++ ;
continue ;
}
if ( (instr[index] & 0xe0) == 0xc0 )
{
if ((instr[1] & 0xc0) == 0x80)
{
(*strlen)++ ;
index += 2 ;
continue ;
}
else
{
st = csm_error(SQL_ERR_INV_CHARACTERS) ;
break ;
}
}
if ( (instr[index] & 0xf0) == 0xe0 )
{
if( ((instr[1] & 0xc0) == 0x80) &&
((instr[2] & 0xc0) == 0x80) )
{
(*strlen)++ ;
index += 3 ;
continue ;
}
else
{
st = csm_error(SQL_ERR_INV_CHARACTERS) ;
break ;
}
// if none of the above coding pattern matches
// it implies that the character is not a valid
// UTF-8 cahracter
st = csm_error(SQL_ERR_INV_CHARACTERS) ;
}// end of while loop
return st;
}
But this will again depend on How the underlying processor is, whether
it is little endian or bigendian, whether the encoding is ASCII type or
EBCDIC. Can anyone see a way to tackle this problem.
Also can anyone who is familiar with using the codecvt template class
defined in the <locale> in namespace std please provide me with a sample
code that does some conversion stuff between two different
characrter character sets.
In the directory /usr/lib/locale we have differnt directories that have
names of different locales. Right???
In each of these when have some more directories like LC_COLLATE,
LC_CTYPE etc.. Unlike in en_US.UTF-8 dir where we have all the attributes
ie LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME
LO_LTYPE, in other directories like fr.UTF-8 we have only LC_MESSAGES,
what does this mean???
UTF-8 is supposed to cover character of all the major languages that are
used in the world today. Then what is the purpose of having differnt
locales like de.UTF-8 de.UTF-8@euro en_US.UTF-8 es.UTF-8 es.UTF-8@euro
fr.UTF-8 fr.UTF-8@euro it.UTF-8 it.UTF-8@euro ja_JP.UTF-8 ko.UTF-8
sv.UTF-8 sv.UTF-8@euro etc... What does this mean??
Suppose if were to construct a locale object in this manner
locale my_loc("en_US.UTF-8") ;
locale my_loc1("fr.UTF-8") ; what differnce does it make?
If these are supposed to be different locales. we will have to shift
between locales, whenever we need to. So is there something like locale
called just UTF-8 which would have all the characters required. insteading
of having something like country.UTF-8.
As I mentioned above that there are dir's like en_US.UTF-8 which int
urn have dir's like LC_CTYPE, LC_COLLATE etc. These dir's (LC_COLLATE) are
empty. Is it somethinh that we have to fill the required collation
sequences in files and write them to this dir.
We are actually working on SQL tool, the code of which is beign written
in C++, and we are into a i18n process (i18n -internationalisation). We
are to implement a UTF-8 character set in our code which is supposed to be
a super set of all major character sets. In a threaded model of the SQL
tool different users will be working in different threads, and this is
where thread safety is required.
Could anyone help be some of the above unresolved issues.
Regards,
Jeu
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/