Re: wchar_t, mbrtowc and iconv (was: Re: Updated UTF-8 decoder stress test file)

Marcin 'Qrczak' Kowalczyk Tue, 05 Sep 2000 14:16:12 -0700
Tue, 5 Sep 2000 19:40:17 +0200 (CEST), Bruno Haible <[EMAIL PROTECTED]> pisze:

> If your internal representation of text is Unicode, then why do you
> bother with wchar_t[] at all?

To provide a way to interface to C libraries which talk in terms of
wchar_t[]. (I haven't seen such library yet except libc, but maybe
they will be more popular?)

> For the conversions you can use iconv() and a normalizing wrapper
> around nl_langinfo(CODESET).

I don't like the idea of finding and interpreting locale.aliases by
applications themselves...

> In glibc 2.1.93 it does: use iconv with "wchar_t" argument. It also
> knows about "UCS-4" and "UCS-4LE" encodings.

Good, so there is a chance that future iconv will be more usable (it's
a development version of glibc, isn't it? so it's still future for me).

> Which limitations does the portable iconv substitute (libiconv) have?

That it must be carried along with a package - it's not a tiny wrapper
around what the OS+std.libc provide but the whole implementation
from scratch.

And that there is no nice way to determine either the name of the
default local encoding or a known encoding of Unicode (for iconv in
general). It all looks like kludges and guessing...

How to determine the quality of a locally installed iconv? For example
I don't consider that in glibc-2.1.3 usable - recently I've seen
"../iconv/skeleton.c:324: __gconv_transform_utf8_internal: Assertion
`nstatus == GCONV_FULL_OUTPUT' failed.", there are several other
errors, checking for illegal UTF-8 is poor.

How can an iconv implementation be portable if it has to know all
charsets that are used on all OSes? What worries me is that it must
do everything itself. If an OS provides an unusual charset, libiconv
will not see it.

OK, iconv is not as bad as I thought the first time, but it's still
far from an elegant solution.

> Java's FileReader class (which implicitly converts char* to Unicode)
> takes an encoding argument. The list of permitted encodings is again
> platform and version dependent. Best is not to use this explicit
> encoding argument and rely on the locale dependent default value.

How do Java implementations find this locale dependent default value?
Do they use e.g. iconv for the actual conversion? Or determine only the
name of the encoding somehow and implement the conversion themselves?

What about Perl and Python? AFAIK recent versions are beginning to
support Unicode. How do they implement the translation between unknown
local charset and known Unicode?

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: wchar_t, mbrtowc and iconv (was: Re: Updated UTF-8 decoder stress test file)

Reply via email to