>Last night I thought so, but then I realized that a smarter encoding is possible. The strands are contiguous series of code units. The string object itself could be just a contiguous series of (offset, strand pointer) pairs. The code unit size of the strand is encoded in the pair somewhere. No actual tree is required. You bsearch the offsets and then index into the discovered strand. If you maintain a current offset marker within the string representation, you'll get O(1) most of the time
Why does the strand size need code size encoded ? Doesn't an array have a size field ? As an alternative you can make a strand as either of these 2 types ( all in UCS-1 , 2 or 4) - [Type bits] [size] [Data] [Next Strand] or [left Strand] [right Strand] [size][Data] or even Node or leaf - [Type bits] [size][Data] The first strand is embedded in the string object so for small strings - String is just a array - There is no pointer memory overhead - You can cast to array for interop ( with the check on type) - And importantly when creating small strings you only create a single object , For longer strings you have the NextStrand so you have a list ( or tree if desired) - You can walk the list via size(of) getting you to the index quickly. - Small strings don't incur large string costs - Large strings can get benefits ( including line walking , more complex index in the lib etc) - Large string design optimized for large string work with the cost moved down to the strand - Costs 1 bit to see strand is tree/list or simple array , maybe if 00 then array. One thing that bothers me with Stranded string is the strands appear to me to be the immutable strings..and string a larger structure that would be mutable to say work with lines , insertions etc. I know the reasons but it seems strange to have say such a flexible structure in the lib whether trees or linked lists yet be immutable. Maybe there is a case where they can be mutable or immutable upon construction but strands are always immutable. > As I have said elsewhere, I strongly oppose a LongString/ShortString distinction. It cannot be used correctly. I know you have stated this but while I have not thought about it much I do know C++ has Cstring and char[] supported etc and we may be able to do something clever with inheritance ( but instead of aVMT we just hook in if string Type A do this if B do that) >>UTF-8 with binary index ,O(1) lookups , Issues with backward compatible API / algorithms and developers needing to understand its bytes. Not great >>for long strings due to large memory object. >Not sure why this creates a compatibility issue, since the index is internal to the runtime layer's implementation. Because your exposing byte offset indexes to programmers who may expect character indexes. From: [email protected] [mailto:[email protected]] On Behalf Of Jonathan S. Shapiro Sent: Sunday, October 17, 2010 4:22 AM To: [email protected]; Discussions about the BitC language Subject: Re: [bitc-dev] Unicode and bitc On Sat, Oct 16, 2010 at 1:21 AM, Ben Kloosterman <[email protected]> wrote: 1) For adding strings isn't it an issue if you add say 1 UCS-1 then 1 UCS-4 then 1 UCS-1 due to the "tree overhead ? So you will need to parse and re-encode ? Do you create strings from chars as UCS-4 in a a string builder or will string builder change to UCS-4 when needed ? Last night I thought so, but then I realized that a smarter encoding is possible. The strands are contiguous series of code units. The string object itself could be just a contiguous series of (offset, strand pointer) pairs. The code unit size of the strand is encoded in the pair somewhere. No actual tree is required. You bsearch the offsets and then index into the discovered strand. If you maintain a current offset marker within the string representation, you'll get O(1) most of the time. 2) Let me confirm that this ,each strand is a separate heap object right ? It lives in the heap, but it is not a first-class object. . UTF-8 with char indexes , O(n) scans so not suitable for medium to long strings. Risk on Medium to long stream performance , Unless you keep an associated side table of (offset, run-size) pairs. . UTF-8 with binary index ,O(1) lookups , Issues with backward compatible API / algorithms and developers needing to understand its bytes. Not great for long strings due to large memory object. Not sure why this creates a compatibility issue, since the index is internal to the runtime layer's implementation. Risk on backward compatibility and developer incorrect usage o Use operator overloading on + , will require a typedef of the index type and could remove some of the compatibility issues. o value type option . Will still require a reference type string eg longstring so can only be considered if we do that. Usefull in cases where you don't have/want a heap As I have said elsewhere, I strongly oppose a LongString/ShortString distinction. It cannot be used correctly. . Strand use , benefits for large strings and GC , some concern over small strings overhead and how it will work in practice eg aggregating mixed strings how to create strings efficiently . Risk is for small string performance and possibly creating strings. I can think of several other options. The point is that all of these are a matter that is purely internal to the runtime. It isn't part of the specification how the string is internally implemented, and must not be. shap No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.862 / Virus Database: 271.1.1/3183 - Release Date: 10/16/10 02:34:00
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
