2010/10/15 Tomasz Gajewski <[email protected]> > In polish (and probably similarly for langauges other countries in > middle and eastern Europe) text is composed mostly of ascii > characters. But we have our special ones: "ąćęłńóśźż" which constitute > almost 7% of letters in typical polish texts and only rarely exist in > sequence. So it means that on average every 14'th character requires > uint16 encoding. >
If this is the case, then it is better from a space perspective to use a UCS16 string than a stranded string. The underlying assumption with stranded strings is indeed that code points of like size occur in sequence in the input text. Ben: Do you have a sense of what the frequency and distribution is of extended code points in typical Chinese text? Anybody: same question for Japanese text and/or Han? As I said at one point earlier, we certainly have the option to store UCS8 characters within a UCS16 strand when doing so is more efficient than assembling adjacent strands. I can see some straightforward heuristics that could handle this sensibly, but doing it optimally requires sophistication and probably isn't worthwhile. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
