Worst case for strands would be its all encoded as USC-2 ( if the algorithm
is tuned  that way)  which means you're no worse of than Java , C# or c with
wide char and you don't need to deal with code pages.  That said it's likely
the encoding will do better with ASCII chars moved into XML and HTML tags
and the occasional USC-2 sequence. 


Ben 

 >-----Original Message-----
 >From: Tomasz Gajewski [mailto:[email protected]]
 >Sent: Saturday, October 16, 2010 3:47 AM
 >To: Discussions about the BitC language
 >Cc: [email protected]
 >Subject: Re: [bitc-dev] Unicode and bitc
 >
 >"Jonathan S. Shapiro" <[email protected]> writes:
 >
 >> On Wed, Oct 13, 2010 at 5:59 PM, Ben Kloosterman <[email protected]>
 >wrote:
 >
 >> The implementation I have in mind is as follows. Some of the low-level
 >> details are being made up as I go along.
 >>
 >> 1. Code points in that can be encoded in a single UTF-8 byte are
 >> represented as bytes. Code points that cannot be encoded in utf-8.1
 >> but can be encoded in a single utf-16 unit are encoded as uint16. All
 >> others are encoded as uint32. This choice, by the way, is not
 >> innocent; it gives us leave to decide when a run is too short to
 >> justify switching encodings.
 >>
 >> 2. We break a string into a sequence of "strands" such that all
 >> elements of a given strand have like encodings. The string is then
 >> represented as a balanced tree of substrings keyed by their starting
 >> positions.
 >>
 >> 3. The division of a string into strands  is performed once at
 >> [de]serialization time, or when literals are handled by the compiler.
 >
 >I have one remark about this implementation. While it plays nicely with
 >cases where text constitutes mostly of characters with uint16 encoding
 >with occasional byte encoded ones because we can always encode those
 >"exceptional" ones as uint16 without much loss it is not so with texts
 >with occasional characters with greater encoding.
 >
 >In polish (and probably similarly for langauges other countries in
 >middle and eastern Europe) text is composed mostly of ascii
 >characters. But we have our special ones: "ąćęłńóśźż" which constitute
 >almost 7% of letters in typical polish texts and only rarely exist in
 >sequence. So it means that on average every 14'th character requires
 >uint16 encoding.
 >
 >It will behave not well enough I think. Maybe additional encoding could
 >be somehow allowed for such cases? I don't like this option but
 >internally it could be used without big problems I think.
 >
 >Regards
 >Tomasz Gajewski
 >No virus found in this incoming message.
 >Checked by AVG - www.avg.com
 >Version: 9.0.862 / Virus Database: 271.1.1/3183 - Release Date: 10/15/10
 >14:34:00


_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to