2010/10/15 Tomasz Gajewski <[email protected]>

> In polish (and probably similarly for langauges other countries in
> middle and eastern Europe) text is composed mostly of ascii
> characters. But we have our special ones: "ąćęłńóśźż" which constitute
> almost 7% of letters in typical polish texts and only rarely exist in
> sequence. So it means that on average every 14'th character requires
> uint16 encoding.
>

If this is the case, then it is better from a space perspective to use a
UCS16 string than a stranded string. The underlying assumption with stranded
strings is indeed that code points of like size occur in sequence in the
input text.

Ben: Do you have a sense of what the frequency and distribution is of
extended code points in typical Chinese text?

Anybody: same question for Japanese text and/or Han?

As I said at one point earlier, we certainly have the option to store UCS8
characters within a UCS16 strand when doing so is more efficient than
assembling adjacent strands. I can see some straightforward heuristics that
could handle this sensibly, but doing it optimally requires sophistication
and probably isn't worthwhile.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to