On Sat, 20 Mar 2010 16:55:15 -0300, Robert Jacques wrote: > On Wed, 17 Mar 2010 02:57:34 -0300, Michael Rynn > <[email protected]> wrote: > [snip] >> I want to add that the problems of AA with garbage collection are not >> particular to the implementation of built-in AA. The number of nodes >> that builtin AA creates will increase GC scanning, but the AA slowdowns >> also occurs using the pyDict implementation with string keys. > > Actually, you can design the AA to 'play nice' with D's GC. See RandAA.
I have done a lot of work, comparing and testing the various AA implementations with uint and string keys. The results are on http:// www.dsource.org/projects/aa/wiki/HashTreeBenchmark. The adaptation sources and benchmark program are in the http://www.dsource.org/projects/ aa source tree. I included 2 variants of the RandomAA in the test results, one using the current psuedo-random sequence and the other using a similar slot sequence algorithm to the pydict implementation. I called the 2nd class the parallel-py AA, (still in the RandAA file as the default template option). The Parallel-rd sequence using psuedo random lookup was a clear performance loser, possibly as the keys are well randomized already. Parallel-py outperforms the pydict on uint keys, but lagged on string keys. So the idea of using separate arrays to hide from unnecessary GC scans may be a help. The best performer all round was the Tango-based hashmap. Builtin AA has a few issues with really big datasets as noted before, and was not so fast with insertions and integer sized keys. The big dataset GC problem is hopefully and probably not an issue for smaller datasets ( < 50,000) , unless maybe the data is has a lot of unlucky significant pointers, something very hard to control. There is now a template code version of the Builtin AA class (hash/ hashtree.d) with added optimizations of node heap management (from Tango Container), no hash necessary for integer sized keys, and enforceable cleanup. This optimized hashtree on these benchmarks performed as well as the Tango hashmap. I also added some enforceable cleanup for the Tango based hashmap, so that GC failure issues did not affect timings on big datasets. --- taf Michael Rynn
