On Thu, Oct 14, 2010 at 8:43 PM, Ben Kloosterman <[email protected]> wrote:
> It doesn’t really screw them for 3 reasons > - western char content is common ! I tested a number of asian web pages > with native content ( Indian , Chinese and Thai) and in all cases UTF-8 was > about 20-30% smaller than UTF-16. > This would imply that only 40% of the content was UCS2 code points. Does that sanity check? And if this is so, then the stranded string implementation would likewise be 20-30% smaller, but would not impose O(n) random indexing cost. > - UTF-8 stores nearly all the common UTF-16 2 bytes chars In at worst 3 > bytes and the common BP chars often still take 2 bytes. > But encoding density isn't the main issue. Indexing performance is.
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
