>2010/10/15 Ben Kloosterman <[email protected]>: >> The main cons I see is besides the tree index/reference cost , each >> substring would need a field (which may be aligned to 4-8 bytes) or >char to >> indicate the encoding and the higher initial / final parse overhead. > >I think shap imagines that there are different types for leaf nodes >with different encodings, so the encoding is determined by the type/gc >tag. So a string with one encoding type would appear in memory as
This is really saying we have strands ( which are strings) of a certain encoding within a big string ..so it is mainly a abstraction wrapper . I think small string and big string separation may be better 1.A tree with separate types will incur quite a large cost eg even 2 empty references is 16 bytes which is a bit much for an empty string especially consider string arrays initialized to empty strings. Doing a cout << English chars << chinese char << English char etc would be problematic , while it appears simple , in practice you would need to parse and convert them all to say UCS-4 after having them as USC-1 and UCS-4 , else the tree becomes too big. 2. The problem we are trying to solve ( GC , O(N) ) apply only to large strings so why pay the price for frequently used small strings. A horses for courses approach may fit better and the big string can solve a number of other problems. 3. Small strings can be placed in the tree eg a UTF-8 small string can simply be placed inside as a node and the big string uses byte indexing ( all hidden internally) . Ben _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
