On Thu, Oct 14, 2010 at 8:43 PM, Ben Kloosterman <[email protected]> wrote:

> It doesn’t really screw them for 3 reasons
> - western char content is common ! I tested a number of asian web pages
> with native content ( Indian , Chinese and Thai) and in all cases UTF-8 was
> about 20-30% smaller than UTF-16.
>
This would imply that only 40% of the content was UCS2 code points. Does
that sanity check?

And if this is so, then the stranded string implementation would likewise be
20-30% smaller, but would not impose O(n) random indexing cost.


> - UTF-8 stores nearly all the common UTF-16 2 bytes chars In at worst 3
> bytes and the common BP chars often still take 2 bytes.
>
But encoding density isn't the main issue. Indexing performance is.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to