On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote:
UCS-2 is a disaster in every way. someone had to say it. :)

UCS-2 has been deprecated for many years.


everything should be ascii, utf8 or ucs-4 or migrating to it.

UCS-4 has also been deprecated for many years. The main forms of Unicode in use are UTF-16, UTF-8, and (less frequently) UTF-32.

On Feb 9, 2007, at 6:02 AM, Duncan Coutts wrote:
Apparently UTF-16 (which is like UCS-2 but covers all code points) is a good internal format. It is more compact than UTF-32 in almost all cases
and a less complex encoding than UTF-8. So it's faster than either
UTF-32 (because of data-density) or UTF-8 (because of the encoding
complexity). The downside compared to UTF-32 is that it is a more
complex encoding so the code is harder to write (but apparently it
doesn't affect performance much because characters outside the BMP are
very rare).

UTF-16 is never less compact than UTF-32. The worst case of UTF-16 is that it is the same size as UTF-32. This only happens when a string consists entirely of characters from the supplementary planes.

The ICU lib uses UTF-16 internally I believe, though I can't at the
moment find on their website the bit where they explain why the use
UTF-16 rather than -8 or -32.

http://icu.sourceforge.net/userguide/unicodeBasics.html

UTF-16 is the native Unicode encoding for ICU, Microsoft Windows, and Mac OS X.

Btw, when it comes to all these encoding names, I find it helpful to
maintain the fiction that there's no such thing (any more) as UCS-N,
there's only UTF-8, 16 and 32. This is also what the Unicode consortium
tries to encourage.

It's not a fiction. :-) UCS-2 and UCS-4 are *deprecated*, by the merger between Unicode and ISO 10646 that limited the code point space to [0..0x10FFFF]. In addition to UTF-8, UTF-16, and UTF-32, there's SCSU, a compressed form used in some applications. See:

http://www.unicode.org/reports/tr17/

My view is that we should just provide all three:
Data.PackedString.UTF8
Data.PackedString.UTF16
Data.PackedString.UTF32

that all provide the same interface. This wouldn't actually be too much code to write since most of it can re-use the streams code, so the only
difference is the single implementation per-encoding of:
stream   :: PackedString -> Stream Char
unstream :: Stream Char -> PackedString

and then get fusion for free of course.

I agree that all three should be supported. UTF-16 is used in Windows and Mac OS X, and UTF-8 is widely used on Unix platforms (and at the BSD level of Mac OS X). UTF-32 matches the Char type in Haskell, and is used for wchar_t on some platforms. SCSU can be handled the same as a non-Unicode encoding (e.g., like GB2312 or Shift JIS).

Note that some Unicode algorithms require the ability to back up in a stream of code points, so that may be a consideration in the design (maybe they could be implemented in Haskell in a way that doesn't require that; I'm still learning, so I'm not sure yet). And regular expression processing requires essentially random access (same Haskell-fu considerations apply).

I have proposed this task as an MSc project in my department. Hopefully
we'll get a student to pick this up.

I hope so!

ICU has a BSD/MIT-style license, so feel free to steal whatever is appropriate from there. I would love to see Haskell support Unicode operations like locale-sensitive collation, text boundary analysis, and more on both [Char] and packed strings.

Deborah

_______________________________________________
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell

Reply via email to