On 05/02/07, Chris Kuklewicz <[EMAIL PROTECTED]> wrote:
shelarcy wrote:

> Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.

UTF-8 uses 1,2,3, or 4 bytes.  Anything that does not support 4 bytes does  not
support UTF-8

Well, some of them are probably a bit dated; they likely supported an
older version of the standard.


> But Takusen's implementation support it correctly.

The Takusen does have unreachable dead code to serialize Char as (ord c :: Int)
up to 31 bits into as many as 6 bytes.  But it does decode up to 6 bytes to 31
bits and try to "chr" this from Int to Char.  Decoding that many bits is not
consistent with the UTF-8 standard.
UTF-8 is a 4 byte encoding.  There is no valid UTF-8 5 or 6 byte encoding.

Chris is right here, in that Takusen's decoder is incorrect w.r.t. the
standard, in allowing up to 6 bytes to encode a single char. If it was
correct, it would reject 5 and 6 byte sequences. I copied the extended
conversion from HXT's code, which was the most correct UTF8 library I
had seen so far (it just didn't marshal directly from a CString, which
was what I was after).

Turns out darcs has the most accurate UTF8 en + de-coders:
 http://abridgegame.org/cgi-bin/darcs.cgi/darcs/UTF8.lhs?c=annotate

There's nothing stopping the Unicode consortium from expanding the
range of codepoints, is there? Or have they said that'll never happen?

Alistair
_______________________________________________
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell

Reply via email to