Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Deborah Goldsmith Fri, 09 Feb 2007 16:21:27 -0800

On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote:

UCS-2 is a disaster in every way. someone had to say it. :)


UCS-2 has been deprecated for many years.


everything should be ascii, utf8 or ucs-4 or migrating to it.

UCS-4 has also been deprecated for many years. The main forms ofUnicode in use are UTF-16, UTF-8, and (less frequently) UTF-32.


On Feb 9, 2007, at 6:02 AM, Duncan Coutts wrote:

Apparently UTF-16 (which is like UCS-2 but covers all code points)is agood internal format. It is more compact than UTF-32 in almost allcases

and a less complex encoding than UTF-8. So it's faster than either
UTF-32 (because of data-density) or UTF-8 (because of the encoding
complexity). The downside compared to UTF-32 is that it is a more
complex encoding so the code is harder to write (but apparently it
doesn't affect performance much because characters outside the BMP are
very rare).

UTF-16 is never less compact than UTF-32. The worst case of UTF-16 isthat it is the same size as UTF-32. This only happens when a stringconsists entirely of characters from the supplementary planes.


The ICU lib uses UTF-16 internally I believe, though I can't at the
moment find on their website the bit where they explain why the use
UTF-16 rather than -8 or -32.


http://icu.sourceforge.net/userguide/unicodeBasics.html

UTF-16 is the native Unicode encoding for ICU, Microsoft Windows, andMac OS X.

Btw, when it comes to all these encoding names, I find it helpful to
maintain the fiction that there's no such thing (any more) as UCS-N,
there's only UTF-8, 16 and 32. This is also what the Unicodeconsortium
tries to encourage.

It's not a fiction. :-) UCS-2 and UCS-4 are *deprecated*, by themerger between Unicode and ISO 10646 that limited the code pointspace to [0..0x10FFFF]. In addition to UTF-8, UTF-16, and UTF-32,there's SCSU, a compressed form used in some applications. See:


http://www.unicode.org/reports/tr17/

My view is that we should just provide all three:
Data.PackedString.UTF8
Data.PackedString.UTF16
Data.PackedString.UTF32
that all provide the same interface. This wouldn't actually be toomuchcode to write since most of it can re-use the streams code, so theonly
difference is the single implementation per-encoding of:
stream   :: PackedString -> Stream Char
unstream :: Stream Char -> PackedString

and then get fusion for free of course.

I agree that all three should be supported. UTF-16 is used in Windowsand Mac OS X, and UTF-8 is widely used on Unix platforms (and at theBSD level of Mac OS X). UTF-32 matches the Char type in Haskell, andis used for wchar_t on some platforms. SCSU can be handled the sameas a non-Unicode encoding (e.g., like GB2312 or Shift JIS).

Note that some Unicode algorithms require the ability to back up in astream of code points, so that may be a consideration in the design(maybe they could be implemented in Haskell in a way that doesn'trequire that; I'm still learning, so I'm not sure yet). And regularexpression processing requires essentially random access (sameHaskell-fu considerations apply).

I have proposed this task as an MSc project in my department.Hopefully
we'll get a student to pick this up.


I hope so!

ICU has a BSD/MIT-style license, so feel free to steal whatever isappropriate from there. I would love to see Haskell support Unicodeoperations like locale-sensitive collation, text boundary analysis,and more on both [Char] and packed strings.


Deborah

_______________________________________________
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell

Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Reply via email to