Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Deborah Goldsmith Mon, 01 Oct 2007 19:50:48 -0700

Sorry for the long delay, work has been really busy...


On Sep 27, 2007, at 12:25 PM, Aaron Denney wrote:

On 2007-09-27, Aaron Denney <[EMAIL PROTECTED]> wrote:

Well, not so much. As Duncan mentioned, it's a matter of what themostcommon case is. UTF-16 is effectively fixed-width for the majorityof
text in the majority of languages. Combining sequences and surrogate
pairs are relatively infrequent.
Infrequent, but they exist, which means you can't seek x/2 bytesahead
to seek x characters ahead.  All such seeking must be linear for both
UTF-16 *and* UTF-8.
Speaking as someone who has done a lot of Unicode implementation, I
would say UTF-16 represents the best time/space tradeoff for an
internal representation. As I mentioned, it's what's used inWindows,
Mac OS X, ICU, and Java.


I guess why I'm being something of a pain-in-the-ass here, is that
I want to use your Unicode implementation expertise to know what
these time/space tradeoffs are.

Are there any algorithmic asymptotic complexity differences, or all
these all constant factors?  The constant factors depend on projected
workload.  And are these actually tradeoffs, except between UTF-32
(which uses native wordsizes on 32-bit platforms) and the other two?
Smaller space means smaller cache footprint, which can dominate.

Yes, cache footprint is one reason to use UTF-16 rather than UTF-32.Having no surrogate pairs also doesn't save you anything because youneed to handle sequences anyway, such as combining marks and clusters.


The best reference for all of this is:

http://www.unicode.org/faq/utf_bom.html

See especially:
http://www.unicode.org/faq/utf_bom.html#10
http://www.unicode.org/faq/utf_bom.html#12

Which data type is best depends on what the purpose is. If the datawill primarily be ASCII with an occasional non-ASCII characters, UTF-8may be best. If the data is general Unicode text, UTF-16 is best. Iwould think a Unicode string type would be intended for processingnatural language text, not just ASCII data.

Simplicity of algorithms is also a concern. Validating a bytesequenceas UTF-8 is harder than validating a sequence of 16-bit values asUTF-16.
(I'd also like to see a reference to the Mac OS X encoding. I knowthat
the filesystem interface is UTF-8 (decomposed a certain a way).  Is it
just that UTF-16 is a common application choice, or is there some
common framework or library that uses that?)

UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon,and is what appears in the APIs for all of them. UTF-16 is also what'sstored in the volume catalog on Mac disks. UTF-8 is only used in BSDAPIs for backward compatibility. It's also used in plain text files(or XML or HTML), again for compatibility.


Deborah

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Reply via email to