Re: Unicode support

Kent Karlsson Tue, 09 Oct 2001 03:14:58 -0700


----- Original Message -----
From: "Ashley Yakeley" <[EMAIL PROTECTED]>
To: "Kent Karlsson" <[EMAIL PROTECTED]>; "Haskell List" <[EMAIL PROTECTED]>; 
"Libraries for Haskell List"
<[EMAIL PROTECTED]>
Sent: Tuesday, October 09, 2001 12:27 PM
Subject: Re: Unicode support

> At 2001-10-09 02:58, Kent Karlsson wrote:
>
> >In summary:
> >
> >    code position (=code point): a value between 0000 and 10FFFF.
>
> Would this be a reasonable basis for Haskell's 'Char' type?

Yes.  It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10FFFF instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t.  As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.

> At some point
> perhaps there should be a 'Unicode' standard library for Haskell. For
> instance:
>
> encodeUTF8 :: String -> [Word8];
> decodeUTF8 :: [Word8] -> Maybe String;
> encodeUTF16 :: String -> [Word16];
> decodeUTF16 :: [Word16] -> Maybe String;
>
> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
> getGeneralCategory :: Char -> Maybe GeneralCategory;

There is not really any "Maybe" just there.  Yet unallocated code
positions have general category Cn (so do non-characters):
      Cs Other, Surrogate
      Co Other, Private Use
      Cn Other, Not Assigned (yet)

> ...sorting & searching...
>
> ...canonicalisation...
>
> etc. Lots of work for someone.

Yes.  And it is lots of work (which is why I'm not volonteering
to make a qick fix: there is no quick fix).

        Kind regards
        /kent k

_______________________________________________
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell

Re: Unicode support

Reply via email to