Re: [bitc-dev] Unicode and bitc

Michal Suchanek Sat, 16 Oct 2010 04:43:08 -0700

2010/10/16 Jonathan S. Shapiro <[email protected]>:
> 2010/10/15 Tomasz Gajewski <[email protected]>
>>
>> In polish (and probably similarly for langauges other countries in
>> middle and eastern Europe) text is composed mostly of ascii
>> characters. But we have our special ones: "ąćęłńóśźż" which constitute
>> almost 7% of letters in typical polish texts and only rarely exist in
>> sequence. So it means that on average every 14'th character requires
>> uint16 encoding.
>
> If this is the case, then it is better from a space perspective to use a
> UCS16 string than a stranded string. The underlying assumption with stranded
> strings is indeed that code points of like size occur in sequence in the
> input text.
>
> Ben: Do you have a sense of what the frequency and distribution is of
> extended code points in typical Chinese text?


If you look at Chinese manual for your mainboard or hardrive you will
likely notice that it's mostly Chinese ideograms with occasional Latin
word or two for technical terms and trademarks and occasional strings
of "arabic" numerals to represent a number.

In less technical texts the Latin charecters will be rarer.

>
> Anybody: same question for Japanese text and/or Han?

For Japanese you will get a mix of Kanji (chinese-like characters) and
Kana which both require 16bit encoding. The Latin characters are rarer
because the technical terms can be transcribed in Kana. Often the
Latin characters and numerals will be "full width" which is encoded
specially and requires 16bit encoding as well.

For Korean you will get something that looks like one of the above
except the characters are different.

Thanks

Michal

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to