[Jprogramming] burstsort - using Burst Tries: fastest sort?

j . language Thu, 02 Aug 2007 16:10:39 -0700

Does anybody have 'j code' implementing this?  The
discription found at:
 
http://citeseer.ist.psu.edu/cachedpage/526368/1


indicates to me to me it could eligantly be emplimented in
J.

If you were to turn it into an .exe file for dos/windows or
an executable for unix ???
 
ABSTRACT
 
Many applications depend on efficient management of large
sets of distinct strings in  memory.
For example, during index construction for text databases a
record is held for each distinct word
in the text, containing the word itself and information such
as counters.  We Purpose a new
data structure, the burst trie, that has significant
advantages over existing options for such
applications:  it requires no more memory than a binary
tree; it is as fast as a trie; and, while
not as fast as a hash table, a burst trie maintains the
strings in sorted or near sorted order.
In this paper we describe burst tries and explore the
parameters that govern their performance.
We experimentally determine good choices of parameters, and
compare burst tries to other
structures used for the same task, with a variety of data
sets.  The experiments show that
the burst trie is particularly effective for the skewed
frequency distributions common in text
collections, and dramatically outperforms all other data
structures for the task of managing
strings while maintaining sort order.
 
Conclusion:
We have described the burst trie, a variant form of the trie
that is highly efficient for managing
large sets of strings in memory.  In the burst trie, each
leaf holds a set of strings in a container,
allowing dramatic space reductions with no impact on
efficiency.  We have explored a wide range
of options for managing a burst trie, including different
container structures, different heuristics
for deciding when to burst, and different techniques for
bursting.  In large scale experiments
with a variety of real data sets, we have shown that burst
tries are faster than compact tries
and use 1/6 of the space:  are more compact than binary
trees or splay trees and are
over two times faster: and are close to hash tables in
efficiency, yet keep the data in sort order.
 
These results demonstrate that the burst trie is
dramatically more efficient than any previous
structure for the task of managing sorted strings.  The
improvements are the same on data sets
ranging from 100 Mb to 45 Gb in size.  For text data, these
gains are because the more common
strings are typically also the shorter strings, and are
stored wholly within the access trie.  Rare
strings held in containers, but because they are typically
long there is no loss of efficiency:
the cost of searching the container is offset by the savings
of not having to traverse a large number
of trie nodes.  Together, these savings allow in memory
processing of larger sets of strings than
was previously possible, in much less time.
 
the performance of burst trie depends on the distribution of
strings.  The worst case
we observed was genomic data, where the strings are of equal
length and have a relatively
flat probability distribution.  Even in this case,
performance was comparable to the other tree
structures.  Analytical results have shown that the height
of a burst trie is expected to be
logarithmic.  We therefore believe that the burst trie is
the data structure of choice for
practical string management.
 
 
 
 
 
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] burstsort - using Burst Tries: fastest sort?

Reply via email to