Re: Updated UTF-8 decoder stress test file

Marcin 'Qrczak' Kowalczyk Tue, 05 Sep 2000 00:30:50 -0700

Sun, 03 Sep 2000 16:12:09 +0100, Markus Kuhn <[EMAIL PROTECTED]> pisze:

>   UTF-8-test.txt

Ah, helped to find a bug in my UTF-8 decoder for Haskell.

And showed me that iconv in glibc-2.1.3 sucks ("break" for a wrong loop
in UTF-8 decoder, does not try to detect many illegal sequences, gives
bad errno when UCS-4 encoder is given an odd-sized output buffer).

libiconv is better but sometimes returns more U+FFFD characters than
recommended there.

> It now contains an additional section 5 with UTF-8 sequences for
> illegal code positions that a good decoder should reject (surrogates,
> U+FFFE, U+FFFF) like overlong and malformed sequences for security
> reasons, as well as all the relevant legal boundary conditions
> for these.

Should they be rejected by decoders of other formats when applicable,
e.g. U+FFFF in UTF-16 or surrogates in UCS-4?

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Re: Updated UTF-8 decoder stress test file

Reply via email to