----- Original Message ----- From: "Ashley Yakeley" <[EMAIL PROTECTED]> To: "Kent Karlsson" <[EMAIL PROTECTED]>; "Haskell List" <[EMAIL PROTECTED]>; "Libraries for Haskell List" <[EMAIL PROTECTED]> Sent: Tuesday, October 09, 2001 12:27 PM Subject: Re: Unicode support
> At 2001-10-09 02:58, Kent Karlsson wrote: > > >In summary: > > > > code position (=code point): a value between 0000 and 10FFFF. > > Would this be a reasonable basis for Haskell's 'Char' type? Yes. It's essentially UTF-32, but without the fixation to 32-bit (21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited to 10FFFF instead of 31(!) bits) is the datatype used in some implementations of C for wchar_t. As I said in another e-mail, if one does not have high efficiency concerns, UTF-32 is a rather straighforward way of representing characters. > At some point > perhaps there should be a 'Unicode' standard library for Haskell. For > instance: > > encodeUTF8 :: String -> [Word8]; > decodeUTF8 :: [Word8] -> Maybe String; > encodeUTF16 :: String -> [Word16]; > decodeUTF16 :: [Word16] -> Maybe String; > > data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... > getGeneralCategory :: Char -> Maybe GeneralCategory; There is not really any "Maybe" just there. Yet unallocated code positions have general category Cn (so do non-characters): Cs Other, Surrogate Co Other, Private Use Cn Other, Not Assigned (yet) > ...sorting & searching... > > ...canonicalisation... > > etc. Lots of work for someone. Yes. And it is lots of work (which is why I'm not volonteering to make a qick fix: there is no quick fix). Kind regards /kent k _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell