I have put my current rather lame trie class here:

http://felix.sf.net/trie.cpp

if anyone wants to give it a fling. The trie maps strings
in the form of a pair char*, int to pointers .. in the test
the pointer is the string itself. My test file is

/usr/share/dict/words

which contains a lot of words with Latin-1 encoding (yuk),
you may need to edit the file to change the filename.

Current stats are:

File contains 98325 entries
Trie contains 98325 entries
Total key lengths 831277 bytes
Average key length 8 bytes
Node count 225302 nodes
Memory 11039789 bytes
Cost 112 bytes per key
Leaf count 65219 keys (which are not prefixes of others)

Recall a trie is a digital tree .. basically it's a page
table. In principle, each node contains an array of 256
pointers to child nodes, one for each character. Search
for a key length n then requires n array indexes.

This implementation uses a brain dead compression technique:
if a node has children, two arrays contain the indexes,
sorted, of the children, and pointers to the child nodes.
If a node is added or removed the arrays are realloc()'ed
to the right size. If there are no children, a NULL pointer
is used for both arrays. A counter keeps track of how
many children there are:

class node {
  void *data;
  node *parent;
  kidrep kids;
  ...
};

class kidrep {
  node **kids;
  unsigned char *indexes;
  int nindexes;
  ...
};

Once I'm a bit more sure it all works, I will change the encoding
so if nindexes 1 or 2, kids and indexes will actually contain
the node pointers (and the high bits of nindexes will contain
the actual indexes).

Another improvement, which will make a mess of the iterators,
if to store the value of a leaf key (one which is not a prefix
of any other key) directly in the kids array, sitting the low
bit to 1 to flag it as such.. this will prevent the data being
an odd address however.

An interesting thing about trie is that purely functional
operations are cheap. For example adding a key of length n
requires copying n nodes to make a new trie, then modifying
that -- the rest of the trie can be shared.

Various specialisation such as fixing the keylength to
sizeof(void*) so the keys are pointers, and changing
the data type to a bool (use a bitmap for the bottom
level of the tree) should result in significant performance 
gains.

I'm hoping it is possible to get the bytes per key, where
the key is a heap address, down to sizeof(void*) * 2.x
for some small x. This would allow mapping allocated
pointer to shapes, and also supply the garbage collector
with a low overhead way to scan the heap. 

Judy (that is, Judy arrays) claim a factor of about 1.3 for 
to maintain a set of addresses .. which is better than the
current collector which has a 6 word overhead (48 bytes
on amd64) per heap object.

one downside .. because of the encoding, this data structure
will prevent use of the Boehm collector: if the keys are
pointers they're stored as split up sequence of bytes.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Felix-language mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/felix-language

Reply via email to