"Jonathan S. Shapiro" <[email protected]> writes: > On Wed, Oct 13, 2010 at 5:59 PM, Ben Kloosterman <[email protected]> wrote:
> The implementation I have in mind is as follows. Some of the low-level > details are being made up as I go along. > > 1. Code points in that can be encoded in a single UTF-8 byte are > represented as bytes. Code points that cannot be encoded in utf-8.1 > but can be encoded in a single utf-16 unit are encoded as uint16. All > others are encoded as uint32. This choice, by the way, is not > innocent; it gives us leave to decide when a run is too short to > justify switching encodings. > > 2. We break a string into a sequence of "strands" such that all > elements of a given strand have like encodings. The string is then > represented as a balanced tree of substrings keyed by their starting > positions. > > 3. The division of a string into strands is performed once at > [de]serialization time, or when literals are handled by the compiler. I have one remark about this implementation. While it plays nicely with cases where text constitutes mostly of characters with uint16 encoding with occasional byte encoded ones because we can always encode those "exceptional" ones as uint16 without much loss it is not so with texts with occasional characters with greater encoding. In polish (and probably similarly for langauges other countries in middle and eastern Europe) text is composed mostly of ascii characters. But we have our special ones: "ąćęłńóśźż" which constitute almost 7% of letters in typical polish texts and only rarely exist in sequence. So it means that on average every 14'th character requires uint16 encoding. It will behave not well enough I think. Maybe additional encoding could be somehow allowed for such cases? I don't like this option but internally it could be used without big problems I think. Regards Tomasz Gajewski _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
