Re: Updated UTF-8 decoder stress test file

Marcin 'Qrczak' Kowalczyk Tue, 05 Sep 2000 04:18:31 -0700
Tue, 05 Sep 2000 09:27:44 +0100, Markus Kuhn <[EMAIL PROTECTED]> pisze:

> I have not yet formed an opinion on whether characters > 0x10FFFF
> should be rejected. Apparently ISO 10646-1:2000 deprecates these
> now because they don't fit into UTF-16, but I haven't received a
> copy of the new ISO standard yet and want to read the precise text
> first before I form an opinion on that one.

If ISO 10646 is really going to define a 20.087-bit space, then
perhaps Haskell should limit legal Char values to '\0'..'\x10FFFF',
not '\0'..'\x7FFFFFFF'.

I hope that the history of 0x10FFFF will not end like the history of
real mode addressing on Intels :-)

Which endianness variants of UTF-16 make sense to provide? Only BE?
BE, LE and native? What about raw UCS-4? I's not only for external
files but also e.g. for interfacing to C functions taking wchar_t *.

I am aware that wchar_t needs not to be Unicode but don't know what
else can I do than to provide functions usable only in cases where
wchar_t is Unicode. Well, I haven't seen any interesting C libraries
working on wchar_t yet...

What is frustrating about glibc's implementation of iconv is that
it uses the same internal format as one of my internal formats for
interfacing to C implementations of conversions (an array of words in
native endianness) but does not provide that format externally. Best
match is the same in big endian.  To use this iconv, texts would
have to be endian-swapped twice on input and twice on output, and
both my and iconv's framework would use a separate "step" with its
own buffers for each swapping.

I understand that using native endianness in external format is
a bad idea, but it should not imply that a basic operation needs
unnecessary workarounds. libiconv uses the same encoding internally
and does provide it externally too. Why glibc's iconv couldn't so?

I am beginning to hate standards which left every detail as
implementation-defined.

I am still going to use mbrtowc and wcrtomb to convert between Unicode
and local multibyte encoding. In case Glasgow Haskell Compiler is
ported to a system when wchar_t is not Unicode, local conversions
will simply have to be implemented differently. mbrtowc and wcrtomb
seem to be the most direct way on Linux and should work on at least
some other systems too, without magic knowledge about charset names.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Updated UTF-8 decoder stress test file

Reply via email to