On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
This isn't directly related to D (though the code will be in
D), and I
thought this would be a good place to ask.
I'm trying to implement an algorithm that traverses a very
large graph,
and I need some kind of data structure to keep track of which
nodes have
been visited, that (1) allows reasonably fast lookups
(preferably O(1)),
and (2) doesn't require GB's of storage (i.e., some kind of
compression
would be nice).
The graph nodes can be represented in various ways, but
possibly the
most convenient representation is as n-dimensional vectors of
(relatively small) integers. Furthermore, graph edges are
always between
vectors that differ only by a single coordinate; so the edges
of the
graph may be thought of as a subset of the edges of an
n-dimensional
grid. The hashtable, therefore, needs to represent some
connected subset
of this grid in a space-efficient manner, that still allows
fast lookup
times.
The naïve approach of using an n-dimensional bit array is not
feasible
because n can be quite large (up to 100 or so), and the size of
the grid
itself can get up to about 10 in each direction, so we're
looking at a
potential maximum size of 10^100, clearly impractical to store
explicitly.
So, -10 to 10 in discrete steps. This means 5 bits per dimension
and 500 bits for a single coordinate. Is the graph distributed of
a compute cluster or does it fit into single computer? With a few
GB of RAM, this means your graph is quite sparse, yet nodes are
connected ("differ only by a single coordinate")?
Can you preprocess? I mean, walk all the nodes O(n) to compute a
good (perfect?) hash function.
In general, I think you should either store the flag right in the
graph node or mirror the graph structure.
I do not know any concrete algorithms.