Re: [bitc-dev] Unicode and bitc

Ben Kloosterman Sun, 17 Oct 2010 06:07:09 -0700

>Last night I thought so, but then I realized that a smarter encoding is
possible. The strands are contiguous series of code units. The string object
itself could be just a contiguous series of (offset, strand pointer) pairs.
The code unit size of the strand is encoded in the pair somewhere. No actual
tree is required. You bsearch the offsets and then index into the discovered
strand. If you maintain a current offset marker within the string
representation, you'll get O(1) most of the time

Why does the strand size need code size encoded ? Doesn't an array have a
size field ? 

As an alternative you can make a strand as  either of these 2 types ( all in
UCS-1 , 2 or 4) 

-          [Type bits] [size] [Data] [Next Strand] or [left Strand] [right
Strand]  [size][Data]  or even Node or leaf

-          [Type bits] [size][Data] 

The first strand is embedded in the string object so  for small strings 

-          String  is just a array 

-          There is no pointer memory overhead 

-          You can cast to array for interop  ( with the check on type) 

-          And importantly when creating small strings you only create a
single object ,

For longer strings you have the NextStrand so you have a list ( or tree if
desired) 

-          You can walk the list via size(of) getting you to the index
quickly.

-          Small strings don't incur large string costs  

-          Large strings can get benefits ( including line walking  , more
complex index in the lib etc) 

-          Large string design optimized for large string work with the cost
moved down to the strand 

-          Costs 1 bit to see strand is tree/list or simple array , maybe if
00 then array.

One thing that bothers me with Stranded string is the strands appear to me
to be the immutable strings..and string a larger structure that would be
mutable to say work with lines  , insertions etc.  I know the reasons but it
seems strange to have say such a flexible structure in the lib whether trees
or linked lists  yet be immutable. Maybe there is a case where they can be
mutable or immutable  upon construction but strands are always immutable. 

> As I have said elsewhere, I strongly oppose a LongString/ShortString
distinction. It cannot be used correctly. 

I know  you have stated this  but  while  I have not thought about it much
I do know C++ has Cstring and char[]  supported  etc  and we may be able to
do something clever with inheritance ( but instead of  aVMT we just hook in
if string Type A do this if B do that) 

>>UTF-8 with binary index  ,O(1) lookups ,  Issues with backward compatible
API / algorithms  and developers needing to understand its bytes. Not great
>>for long strings due to  large memory object. 

>Not sure why this creates a compatibility issue, since the index is
internal to the runtime layer's implementation. 

Because your exposing byte offset  indexes to programmers who may expect
character indexes. 

From: [email protected] [mailto:[email protected]] On
Behalf Of Jonathan S. Shapiro
Sent: Sunday, October 17, 2010 4:22 AM
To: [email protected]; Discussions about the BitC language
Subject: Re: [bitc-dev] Unicode and bitc

On Sat, Oct 16, 2010 at 1:21 AM, Ben Kloosterman <[email protected]> wrote:

1)      For adding strings isn't it an issue if you add say 1 UCS-1    then
1 UCS-4 then 1 UCS-1   due to the "tree  overhead ? So you will need to
parse and re-encode ? Do you create strings from chars as UCS-4  in a a
string builder  or will string builder change to UCS-4 when needed ?

Last night I thought so, but then I realized that a smarter encoding is
possible. The strands are contiguous series of code units. The string object
itself could be just a contiguous series of (offset, strand pointer) pairs.
The code unit size of the strand is encoded in the pair somewhere. No actual
tree is required. You bsearch the offsets and then index into the discovered
strand. If you maintain a current offset marker within the string
representation, you'll get O(1) most of the time.

2)      Let me confirm that this ,each strand is a separate heap object
right ? 

It lives in the heap, but it is not a first-class object.

 .         UTF-8 with char indexes  , O(n) scans so not suitable for medium
to long strings.  Risk on Medium to long stream performance , 

Unless you keep an associated side table of (offset, run-size) pairs.

.         UTF-8 with binary index  ,O(1) lookups ,  Issues with backward
compatible API / algorithms  and developers needing to understand its bytes.
Not great for long strings due to  large memory object. 

Not sure why this creates a compatibility issue, since the index is internal
to the runtime layer's implementation. 

Risk on backward compatibility and developer incorrect usage

o   Use operator overloading  on +  , will require a typedef of the index
type and could remove some of the compatibility issues.

o   value type option . Will still require a reference type string eg
longstring  so can only be considered if we do that.  Usefull in cases where
you don't have/want a heap

As I have said elsewhere, I strongly oppose a LongString/ShortString
distinction. It cannot be used correctly.

.         Strand use ,  benefits for large strings and GC  , some concern
over small strings overhead and how it will work in practice eg aggregating
mixed strings how to  create strings efficiently  .  Risk is for small
string performance and possibly creating strings. 

I can think of several other options. The point is that all of these are a
matter that is purely internal to the runtime. It isn't part of the
specification how the string is internally implemented, and must not be.

shap 

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.862 / Virus Database: 271.1.1/3183 - Release Date: 10/16/10
02:34:00

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to