At 09:56 PM 12/2/2004, Doug Ewell wrote:
I use ... and UTF-32 for most internal processing that I write
myself.  Let people say UTF-32 is wasteful if they want; I don't tend to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.


For performance-critical applications on the other hand, you need to use
whichever UTF gives you the correct balance in speed and average storage
size for your data.

If you have very large amounts of data, you'll be sensitive to cache
overruns. Enough so, that UTF-32 may be disqualified from the start.
I have encountered systems for which that was true.

If your 'per character' operations are based on parsing for ASCII symbols,
e.g. HTML parsing, then both UTF-8 and UTF-16 allow you to process your
data directly, w/o need to worry about the longer sequences. For such
tasks, it may be that some processors will work faster if working in
32-bit chunks.

However, many 'inner loop' algorithms, such as copy, can be implemented
using native machine words, handling multiple characters, or parts of
characters, at once, independent of the UTF.

And even in those situations, the savings from that better not be
offset by cache limitations.

A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider

1) 1 extra test per character (to see whether it's a surrogate)

2) special handling every 100 to 1000 characters (say 10 instructions)

3) additional cost of accessing 16-bit registers (per character)

4) reduction in cache misses (each the equivalent of many instructions)

5) reduction in disk access (each the equivaletn of many many instructions)

For many operations, e.g. string length, both 1, and 2 are no-ops,
so you need to apply a reduction factor based on the mix of operations
you do perform, say 50%-75%.

For many processors, item 3 is not an issue.

For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
occurrence depending on the architecture. Their relative weight depends
not only on cache sizes, but also on how many other instructions per
character are performed. For text scanning operations, their cost
does predominate with large data sets.

Given this little model and some additional assumptions about your
own project(s), you should be able to determine the 'nicest' UTF for
your own performance-critical case.

A./




Reply via email to