> >On 10/14/10 10:09 PM, Ben Kloosterman wrote: >> On 10/14/10 3:51 PM, Jonathan S. Shapiro wrote: >>> In futher practice, the number of strands tends to be small, so the >>> difference between O(log n) and O(1) is negligible. >> >> Im not sure this is true for example in all languages “<” . “>” point >> and numbers are ASCII. In chinese Y-M-D is mixed chinese and ASCII >numerics >> , in fact in nearly all languages you have UCS-2 codes but >interspersed >> ASCII numbers and punctuation. So you would need some sort of complex >> encoding such that sequences of< length n stay in the higher encoding >form. >> This is also good because short strings would not need a tree and >hence >> incur no cost. > >This is definitely an issue with the proposal. But if it can be >surmounted, I think the stranded-string proposal is a nice one--- >certainly better than settling on any particular utf-N for everything. >UTF-8 works well for European languages and about half the content of >Asian languages, but that doesn't convince me that the other half of >Asian languages should get screwed, or that utf-8 is the best internal >representation in the world.
It doesn’t really screw them for 3 reasons - western char content is common ! I tested a number of asian web pages with native content ( Indian , Chinese and Thai) and in all cases UTF-8 was about 20-30% smaller than UTF-16. - Asian strings are naturally shorter , first , middle and last name combined in China is 3-4 characters ! - UTF-8 stores nearly all the common UTF-16 2 bytes chars In at worst 3 bytes and the common BP chars often still take 2 bytes. Ben _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
