On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote: > On Tue, Feb 06, 2007 at 03:16:17PM +0900, shelarcy wrote: > > I'm afraid that its fantasy is broken again, as no surrogate > > pair UCS-2 cover all language that is trusted before Europe > > and America people. > > UCS-2 is a disaster in every way. someone had to say it. :) > > everything should be ascii, utf8 or ucs-4 or migrating to it.
Apparently UTF-16 (which is like UCS-2 but covers all code points) is a good internal format. It is more compact than UTF-32 in almost all cases and a less complex encoding than UTF-8. So it's faster than either UTF-32 (because of data-density) or UTF-8 (because of the encoding complexity). The downside compared to UTF-32 is that it is a more complex encoding so the code is harder to write (but apparently it doesn't affect performance much because characters outside the BMP are very rare). The ICU lib uses UTF-16 internally I believe, though I can't at the moment find on their website the bit where they explain why the use UTF-16 rather than -8 or -32. http://icu.sourceforge.net/ Btw, when it comes to all these encoding names, I find it helpful to maintain the fiction that there's no such thing (any more) as UCS-N, there's only UTF-8, 16 and 32. This is also what the Unicode consortium tries to encourage. My view is that we should just provide all three: Data.PackedString.UTF8 Data.PackedString.UTF16 Data.PackedString.UTF32 that all provide the same interface. This wouldn't actually be too much code to write since most of it can re-use the streams code, so the only difference is the single implementation per-encoding of: stream :: PackedString -> Stream Char unstream :: Stream Char -> PackedString and then get fusion for free of course. I have proposed this task as an MSc project in my department. Hopefully we'll get a student to pick this up. Duncan _______________________________________________ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell