Sorry for the long delay, work has been really busy...

On Sep 27, 2007, at 12:25 PM, Aaron Denney wrote:
On 2007-09-27, Aaron Denney <[EMAIL PROTECTED]> wrote:
Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of
text in the majority of languages. Combining sequences and surrogate
pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead
to seek x characters ahead.  All such seeking must be linear for both
UTF-16 *and* UTF-8.

Speaking as someone who has done a lot of Unicode implementation, I
would say UTF-16 represents the best time/space tradeoff for an
internal representation. As I mentioned, it's what's used in Windows,
Mac OS X, ICU, and Java.

I guess why I'm being something of a pain-in-the-ass here, is that
I want to use your Unicode implementation expertise to know what
these time/space tradeoffs are.

Are there any algorithmic asymptotic complexity differences, or all
these all constant factors?  The constant factors depend on projected
workload.  And are these actually tradeoffs, except between UTF-32
(which uses native wordsizes on 32-bit platforms) and the other two?
Smaller space means smaller cache footprint, which can dominate.

Yes, cache footprint is one reason to use UTF-16 rather than UTF-32. Having no surrogate pairs also doesn't save you anything because you need to handle sequences anyway, such as combining marks and clusters.

The best reference for all of this is:

http://www.unicode.org/faq/utf_bom.html

See especially:
http://www.unicode.org/faq/utf_bom.html#10
http://www.unicode.org/faq/utf_bom.html#12

Which data type is best depends on what the purpose is. If the data will primarily be ASCII with an occasional non-ASCII characters, UTF-8 may be best. If the data is general Unicode text, UTF-16 is best. I would think a Unicode string type would be intended for processing natural language text, not just ASCII data.

Simplicity of algorithms is also a concern. Validating a byte sequence as UTF-8 is harder than validating a sequence of 16-bit values as UTF-16.

(I'd also like to see a reference to the Mac OS X encoding. I know that
the filesystem interface is UTF-8 (decomposed a certain a way).  Is it
just that UTF-16 is a common application choice, or is there some
common framework or library that uses that?)

UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and is what appears in the APIs for all of them. UTF-16 is also what's stored in the volume catalog on Mac disks. UTF-8 is only used in BSD APIs for backward compatibility. It's also used in plain text files (or XML or HTML), again for compatibility.

Deborah

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to