> 
> Our data is essentially a tabular representation of a tree.  Every row
> is a node in the tree.  There are 2-10 values in a row, but tens of
> millions of rows.  So in a sense our queries do depend on values as we
> read them, because for example we'll read a value, find the children
> of a node, read those values, etc. etc. 
> 
> I imagine HDF5 being best for reading large amounts of data each time.
> We would generally always be reading 1 row at a time.  Set up one
> hyperslab, tiny read, new hyperslab, tiny read.
> 
> We have other uses in mind for HDF5 but this particular type of a data
> I wonder, maybe it's just not a good fit.

So, others may have a different opinion on this.

But, in my experience, if you don't mind tuning the I/O operations
yourself by designing appropriate HDF5 persistent structures (e.g. the
HDF5 file) together with appropriate algorithms for handling I/O updates
between your memory resident data and the data as it is stored in the
file, a product like HDF5 gives you pretty much all the knobs you could
ever want to achieve good performance.

If, on the other hand, you are hoping to just take advantage of some
other existing I/O optimizing solution and you DO NOT want to have to
explicitly manage and think about that, then you might want to consider
another product, though I honestly don't know how many can do a good job
of hiding I/O issues/performance -- across a wide range of access
patterns -- from you without also taking a slew of memory.

That said, I have worked on just the kind of problem you describe; a
large tree where the nodes in the tree are maybe a few kilobytes in
size. I store the data in an HDF5 file with nodes grouped into larger
chunks. But, I like to traverse the tree as though it is entirely in
memory without having to think about it. My tree nodes are 'smart'
enough that when I traverse a pointer to a non-memory-resident node, the
I/O to HDF5 happens, automagically. A single read brings many, many
nodes into memory. But, they are located 'near' each other in the tree
so the cost of the read is often well amortized over all the nearby
nodes I wind up traversing anyways. Its relatively easy to code up
something like this.  Performance was reasonable for the application I
was working on at the time but I could also easily predict common
traversals I'd need.

Mark

-- 
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
[email protected]      urgent: [email protected]
T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to