2010/10/16 Jonathan S. Shapiro <[email protected]>: > 2010/10/15 Tomasz Gajewski <[email protected]> >> >> In polish (and probably similarly for langauges other countries in >> middle and eastern Europe) text is composed mostly of ascii >> characters. But we have our special ones: "ąćęłńóśźż" which constitute >> almost 7% of letters in typical polish texts and only rarely exist in >> sequence. So it means that on average every 14'th character requires >> uint16 encoding. > > If this is the case, then it is better from a space perspective to use a > UCS16 string than a stranded string. The underlying assumption with stranded > strings is indeed that code points of like size occur in sequence in the > input text. > > Ben: Do you have a sense of what the frequency and distribution is of > extended code points in typical Chinese text?
If you look at Chinese manual for your mainboard or hardrive you will likely notice that it's mostly Chinese ideograms with occasional Latin word or two for technical terms and trademarks and occasional strings of "arabic" numerals to represent a number. In less technical texts the Latin charecters will be rarer. > > Anybody: same question for Japanese text and/or Han? For Japanese you will get a mix of Kanji (chinese-like characters) and Kana which both require 16bit encoding. The Latin characters are rarer because the technical terms can be transcribed in Kana. Often the Latin characters and numerals will be "full width" which is encoded specially and requires 16bit encoding as well. For Korean you will get something that looks like one of the above except the characters are different. Thanks Michal _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
