Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Thu, 14 Oct 2010 21:13:16 -0700

>
 >On 10/14/10 10:09 PM, Ben Kloosterman wrote:
 >> On 10/14/10 3:51 PM, Jonathan S. Shapiro wrote:
 >>> In futher practice, the number of strands tends to be small, so the
 >>> difference between O(log n) and O(1) is negligible.
 >>
 >> Im not sure this is true for example in all languages “<” . “>” point
 >> and numbers are ASCII. In chinese Y-M-D  is mixed chinese and ASCII
 >numerics
 >> , in fact in nearly all languages you have UCS-2 codes but
 >interspersed
 >> ASCII numbers and punctuation. So you would need some sort of complex
 >> encoding such that sequences of<  length n stay in the higher encoding
 >form.
 >> This is also good because short strings would not need a tree and
 >hence
 >> incur no cost.
 >
 >This is definitely an issue with the proposal. But if it can be
 >surmounted, I think the stranded-string proposal is a nice one---
 >certainly better than settling on any particular utf-N for everything.
 >UTF-8 works well for European languages and about half the content of
 >Asian languages, but that doesn't convince me that the other half of
 >Asian languages should get screwed, or that utf-8 is the best internal
 >representation in the world.


It doesn’t really screw them for 3 reasons 
- western char content is common ! I tested a number of asian web pages with 
native content ( Indian , Chinese and Thai) and in all cases UTF-8 was about 
20-30% smaller than UTF-16.
- Asian strings are naturally shorter  , first , middle and last name combined 
in China is 3-4 characters ! 
- UTF-8 stores nearly all the common UTF-16 2 bytes chars In at worst 3 bytes 
and the common BP chars often still take 2 bytes. 

Ben


_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to