On 10/16/10 2:46 AM, Jonathan S. Shapiro wrote:
> Ben: Do you have a sense of what the frequency and distribution is of
> extended code points in typical Chinese text?
>
> Anybody: same question for Japanese text and/or Han?

It depends wildly on the domain of text. For example, if it's literature 
(like the instruction manual Michal mentioned) then it'll be mostly 
native codepoints with the occasional word in English and most numbers 
in ASCII. However, if you are looking at marked up text such as 
downloading (or serving) web pages, then you'll get about 40~60% of the 
code points are utf8 due to the HTML, CSS, JavaScript, etc which are all 
severely biased towards English and ASCII/utf8.

Those numbers apply to standard Japanese and to Mandarin Chinese as of 
pretty recently. This is one of the reasons I really like the stranded 
string approach: it offers a convenient way to store utf8 biased code in 
utf8 and leaving human text to use a more appropriate representation, 
without unduly exposing the ugliness of switching back and forth to the 
user.

-- 
Live well,
~wren
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to