On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
Another OT thread to pick your brains. :)
What's a good, efficient file structure for storing extremely
large
lookup tables? (Extremely large as in > 10 million entries,
with keys
and values roughly about 100 bytes each.) The structure must
support
efficient adding and lookup of entries, as these two operations
will be
very frequent.
I did some online research, and it seems that hashtables
perform poorly
on disk, because the usual hash functions cause random
scattering of
related data (which are likely to be access with higher temporal
locality), which incurs lots of disk seeks.
I thought about B-trees, but they have high overhead (and are a
pain to
implement), and also only exhibit good locality if table
entries are
accessed sequentially; the problem is I'm working with
high-dimensional
data and the order of accesses is unlikely to be sequential.
However,
they do exhibit good spatial locality in higher-dimensional
space (i.e.,
if entry X is accessed first, then the next entry Y is quite
likely to
be close to X in that space). Does anybody know of a good data
structure that can take advantage of this fact to minimize disk
accesses?
T
A burst trie might work, although it is originally designed for
text. I haven't had a chance to try one out in code yet but the
paper by Heinz, Zobel, and Williams is interesting.
Burst Tries: A Fast, Efficient Data Structure for String Keys
(2002)
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499
Joseph