On 10/16/10 2:46 AM, Jonathan S. Shapiro wrote: > Ben: Do you have a sense of what the frequency and distribution is of > extended code points in typical Chinese text? > > Anybody: same question for Japanese text and/or Han?
It depends wildly on the domain of text. For example, if it's literature (like the instruction manual Michal mentioned) then it'll be mostly native codepoints with the occasional word in English and most numbers in ASCII. However, if you are looking at marked up text such as downloading (or serving) web pages, then you'll get about 40~60% of the code points are utf8 due to the HTML, CSS, JavaScript, etc which are all severely biased towards English and ASCII/utf8. Those numbers apply to standard Japanese and to Mandarin Chinese as of pretty recently. This is one of the reasons I really like the stranded string approach: it offers a convenient way to store utf8 biased code in utf8 and leaving human text to use a more appropriate representation, without unduly exposing the ugliness of switching back and forth to the user. -- Live well, ~wren _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
