>2010/10/15 Ben Kloosterman <[email protected]>:
 >> The main cons I see is besides the tree index/reference cost , each
 >> substring would need a field (which may be aligned to 4-8 bytes) or
 >char  to
 >> indicate the encoding and the higher initial / final parse overhead.
 >
 >I think shap imagines that there are different types for leaf nodes
 >with different encodings, so the encoding is determined by the type/gc
 >tag. So a string with one encoding type would appear in memory as

This is really saying we have strands ( which are strings) of a certain
encoding within a big string ..so it is mainly a abstraction wrapper . 

I think small string and big string separation may be better

1.A tree with separate types will incur quite a large cost eg even 2 empty
references is 16 bytes which is a bit much for an empty string especially
consider string arrays initialized to empty strings.  Doing a cout <<
English chars << chinese char << English char etc  would be problematic  ,
while it appears simple , in practice  you would need to parse and convert
them all to say UCS-4 after having them as USC-1 and UCS-4 , else the tree
becomes too big.

2. The problem we are trying to solve ( GC , O(N) )  apply only to large
strings so why pay the price for frequently used small strings. A horses for
courses approach may fit better and the big string can solve a number of
other problems.

3. Small strings can be placed in the tree eg a UTF-8 small string can
simply be placed inside as a node and the big string uses byte indexing (
all hidden internally) .



Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to