Re: [bitc-dev] Unicode and bitc

Jonathan S. Shapiro Sat, 16 Oct 2010 13:23:04 -0700

On Sat, Oct 16, 2010 at 1:21 AM, Ben Kloosterman <[email protected]> wrote:


>  1)      For adding strings isn’t it an issue if you add say 1 UCS-1
>   then  1 UCS-4 then 1 UCS-1   due to the “tree  overhead ? So you will need
> to parse and re-encode ? Do you create strings from chars as UCS-4  in a a
> string builder  or will string builder change to UCS-4 when needed ?
>

Last night I thought so, but then I realized that a smarter encoding is
possible. The strands are contiguous series of code units. The string object
itself could be just a contiguous series of (offset, strand pointer) pairs.
The code unit size of the strand is encoded in the pair somewhere. No actual
tree is required. You bsearch the offsets and then index into the discovered
strand. If you maintain a current offset marker within the string
representation, you'll get O(1) most of the time.


> 2)      Let me confirm that this ,each strand is a separate heap object
> right ?
>
It lives in the heap, but it is not a first-class object.


>  ·         UTF-8 with char indexes  , O(n) scans so not suitable for
> medium to long strings.  Risk on Medium to long stream performance ,
>
Unless you keep an associated side table of (offset, run-size) pairs.


>  ·         UTF-8 with binary index  ,O(1) lookups ,  Issues with backward
> compatible API / algorithms  and developers needing to understand its bytes.
> Not great for long strings due to  large memory object.
>
Not sure why this creates a compatibility issue, since the index is internal
to the runtime layer's implementation.

> Risk on backward compatibility and developer incorrect usage
>
> o   Use operator overloading  on +  , will require a typedef of the index
> type and could remove some of the compatibility issues.
>
> o   value type option . Will still require a reference type string eg
> longstring  so can only be considered if we do that.  Usefull in cases where
> you don’t have/want a heap
>
As I have said elsewhere, I strongly oppose a LongString/ShortString
distinction. It cannot be used correctly.


>  ·         Strand use ,  benefits for large strings and GC  , some concern
> over small strings overhead and how it will work in practice eg aggregating
> mixed strings how to  create strings efficiently  .  Risk is for small
> string performance and possibly creating strings.
>

I can think of several other options. The point is that all of these are a
matter that is purely internal to the runtime. It isn't part of the
specification how the string is internally implemented, and must not be.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to