On 16 October 2010 22:13, Jonathan S. Shapiro <[email protected]> wrote: > On Sat, Oct 16, 2010 at 4:41 AM, Michal Suchanek <[email protected]> > wrote: >> >> 2010/10/16 Jonathan S. Shapiro <[email protected]>: >> > Ben: Do you have a sense of what the frequency and distribution is of >> > extended code points in typical Chinese text? >> >> If you look at Chinese manual for your mainboard or hardrive you will >> likely notice that it's mostly Chinese ideograms with occasional Latin >> word or two for technical terms and trademarks and occasional strings >> of "arabic" numerals to represent a number. > > Yes. But you suggested that the most common 200,000 were within the UCS16 > space, so it wasn't clear to me how many of those ideogram runs might be > UCS16-encodable.
Most of them AFAIK. You would have to look at some electronic Chinese text to be sure but it was possible to write Chinese with those broken 16bit unicode implementations, only languages added later require more than 16bits. What may require 32bits are those "shape selectors" or new characters but those would be rare I would think. The Han unification was probably meant to fit all of CJK into 16bits. As an experiment I tried to extract codepoints > 0xD800 from html on http://tw.asus.com/product.aspx?P_ID=uZV35rNTt6bmvFUX I got these strings which are all full-width punctuation: [",,,(),,,,,,,,,,,,,,,,,,,", "", "!,,,,,,,,,,,,?", "!,,,,,,,,,,,,?", ",,,,!,,,,,,,!,,,,,,,!,,,,,,,,,,,,?"] These are probably the ones in 0xff00 block. There is also a comma in the CJK punctuation in 0x3000 block so the full-width comma is technically bogus and the same applies to parens. Note that the full stop 。is from CJK punctuation, not full width Latin. > > Is there a syllabic script (something similar to a kana) in China? AFAIK there is not and cannot because it would not make sense for Chinese. Thanks Michal _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
