Daniel Resare writes:
> I need to write a program that reads information from a file in UTF-8
> and display it according to the current locale in a highly portable
> fashion. To my understanding there are three ways to do this, and I would
> be delighted to get some input on which one is the most portable and
> flexible.
>
> 1) use setlocale() to set a UTF-8 locale and then use mbsrtowcs() to
> convert the string to wchar_t[] and print out with wprintf().
> Problems:
> * To determine a locale (if any) that is UTF-8 enabled and.
> * other threads using LC_CTYPE dependant functions might break.
This is totally unportable, works only with glibc. Because between
converting the string to wchar_t[] and printing with wprintf() you'd
have to switch locale back to the original one (otherwise you could
equally well printf() the UTF-8 string). This switch makes all
wchar_t[] strings in memory invalid, because wchar_t is locale
dependent.
> 2) convert the file input to wchar_t using iconv() and print out with
> wprintf().
> Problems:
> * To my understanding there is not much specified about the wchar_t type,
> so a program converting to it would need to make some assumptions that
> might not be very portable. (I.e. casting an UCS-4 char* to wchar_t* will
> work) The __STDC_ISO_10646__ macro can be of some help when detecting truly
> wicked systems, but no robust solution seems to exist.
This is better but still not fully portable: Not all iconv
implementations can convert from/to "wchar_t" yet. Only glibc and
libiconv can.
> 3) convert the file input directly to the output charset as found out by
> querying OUTPUT_CHARSET and nl_langinfo(CODESET) and write it out using
> standard printf().
This is the most portable. Forget about OUTPUT_CHARSET, it's nowhere
documented. Only nl_langinfo(CODESET) is documented and standardized.
On platforms where nl_langinfo(CODESET) is not available, libiconv has
a substitute.
> Problems:
> * Is OUTPUT_CHARSET a gnu extension, or part of some standard?
OUTPUT_CHARSET is a glibc specific hack.
> * You loose all useful wchar.h functions in libc.
You can still access these functions, after using mbstowcs.
Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/