"Jonathan S. Shapiro" <[email protected]> writes:

> On Wed, Oct 13, 2010 at 5:59 PM, Ben Kloosterman <[email protected]> wrote:

> The implementation I have in mind is as follows. Some of the low-level
> details are being made up as I go along.
>
> 1. Code points in that can be encoded in a single UTF-8 byte are
> represented as bytes. Code points that cannot be encoded in utf-8.1
> but can be encoded in a single utf-16 unit are encoded as uint16. All
> others are encoded as uint32. This choice, by the way, is not
> innocent; it gives us leave to decide when a run is too short to
> justify switching encodings.
>
> 2. We break a string into a sequence of "strands" such that all
> elements of a given strand have like encodings. The string is then
> represented as a balanced tree of substrings keyed by their starting
> positions.
>
> 3. The division of a string into strands  is performed once at
> [de]serialization time, or when literals are handled by the compiler.

I have one remark about this implementation. While it plays nicely with
cases where text constitutes mostly of characters with uint16 encoding
with occasional byte encoded ones because we can always encode those
"exceptional" ones as uint16 without much loss it is not so with texts
with occasional characters with greater encoding.

In polish (and probably similarly for langauges other countries in
middle and eastern Europe) text is composed mostly of ascii
characters. But we have our special ones: "ąćęłńóśźż" which constitute
almost 7% of letters in typical polish texts and only rarely exist in
sequence. So it means that on average every 14'th character requires
uint16 encoding.

It will behave not well enough I think. Maybe additional encoding could
be somehow allowed for such cases? I don't like this option but
internally it could be used without big problems I think.

Regards
Tomasz Gajewski
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to