Re: [Hdf-forum] paging approaches

Philip Winston Mon, 06 Dec 2010 18:54:09 -0800

>
> I mean the code I gave you mmaps the file as a
> whole, not individual datasets in the file. But, it nonetheless mmaps
> UNDERNEATH the explicit reads/writes (e.g. H5Dread/H5Dwrite calls) made
> by the application. So, I am thinking this is nowhere near the paradigm
> you were hoping for.
>


I was hoping for a true mmap model.  But now I see perhaps that is
impossible.  mmap only works if what is in memory is identical to what's on
disk,  for HDF5 endianness alone can break this assumption right?  Plus lots
of other things like chunked datasets.

So for my situation one option is keep HDF5 around for interchange, but for
runtime "optimize" to a simple binary format where I can mmap the entire
dataset.  Then I can just read/write anywhere and the OS takes care of
everything.

It's tempting to me, coming from a situation where everything is in RAM
today, it seems like the least work to continue to access randomly and let
the OS figured it out. But I don't know how smart it is.  Maybe it is kind
of a red herring, like that would work but it would perform horribly.  Maybe
coming from a situation where everything is in RAM, we have to rethink
things a lot to make it work off disk, to organize stuff for coherence, so
we can read big chunks instead of single rows.

My experience is that for simple queries (give me this hyperslab of
> data), products like HDF5 are going to give better I/O performance than
> some RDBMS. But, if you are really talking about highly sophisticated
> queries where future reads/writes depend upon other parts of the query
> and the datasets being queried, that sounds more like an RDBMS than an
> I/O library sort of thing. Just my two cents. Good luck.
>

Our data is essentially a tabular representation of a tree.  Every row is a
node in the tree.  There are 2-10 values in a row, but tens of millions of
rows.  So in a sense our queries do depend on values as we read them,
because for example we'll read a value, find the children of a node, read
those values, etc. etc.

I imagine HDF5 being best for reading large amounts of data each time.  We
would generally always be reading 1 row at a time.  Set up one hyperslab,
tiny read, new hyperslab, tiny read.

We have other uses in mind for HDF5 but this particular type of a data I
wonder, maybe it's just not a good fit.

-Philip



> On Mon, Dec 6, 2010 at 3:21 PM, Mark Miller <[email protected]> wrote:
> >         I am not sure if you got an answer to this email and so I
> >         thought I
> >         would pipe up.
> >
> >         Yes, you can do mmap if you'd like. I took HDF5's sec2 Virtual
> >         File
> >         Driver (VFD) and tweeked it to use mmap instead just to test
> >         how
> >         something like this would work. I've attached the (hacked)
> >         code. To use
> >         it, you are going to have to learn a bit about HDF5 VFDs.
> >         Learn about
> >         them in File Access Property lists,
> >         http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html, as well as
> >
> >         http://www.hdfgroup.org/HDF5/doc/TechNotes/VFL.html
> >
> >
> >         It is something to start with. I don't know if HDF5 has plans
> >         for
> >         writing an mmap based VFD but they really ought to and it is
> >         something
> >         that is definitely lacking from their supported VFDs
> >         currently.
> >
> >         Mark
> >
> >         On Fri, 2010-12-03 at 17:02, Philip Winston wrote:
> >         > We just added HDF5 support in our application.  We are using
> >         the C
> >         > API. Our datasets are 1D and 2D arrays of integers, a pretty
> >         simple
> >         > structure on disk. Today we have about 5GB of data and we
> >         load the
> >         > whole thing into RAM, do somewhat random reads, make
> >         changes, then
> >         > overwrite the old .h5 file.
> >         >
> >         > I only learned a very minimum amount of the HDF5 API to
> >         accomplish the
> >         > above, and it was pretty easy.  Now we are looking at
> >         supporting much
> >         > larger datasets, such that it will no longer be practical to
> >         have the
> >         > whole thing in memory.  This is where I'm confused on
> >         exactly what
> >         > HDF5 offers vs. what is up to the application, and on what's
> >         the best
> >         > way to do things in the application.
> >         >
> >         > Ideally in my mind what I want is an mmap like interface,
> >         just a raw
> >         > pointer which "magically" pages stuff off disk in response
> >         to reads,
> >         > and writes stuff back to disk in response to writes.  Does
> >         HDF5 have
> >         > something like this, or can/do people end up writing
> >         something like
> >         > this on top of HDF5?  Today our datasets our contiguous and
> >         I assuming
> >         > we'd want chunked datasets instead, but it's not clear to me
> >         how much
> >         > "paging" functionality chunked buys you and how much you
> >         have to
> >         > implement.
> >         >
> >         > Thanks for any ideas or pointers.
> >         >
> >         > -Philip
> >
> >         --
> >         Mark C. Miller, Lawrence Livermore National Laboratory
> >         ================!!LLNL BUSINESS ONLY!!================
> >         [email protected]      urgent: [email protected]
> >         T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511
> >
> >         _______________________________________________
> >         Hdf-forum is for HDF software users discussion.
> >         [email protected]
> >         http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> >
> --
> Mark C. Miller, Lawrence Livermore National Laboratory
> ================!!LLNL BUSINESS ONLY!!================
> [email protected]      urgent: [email protected]
> T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] paging approaches

Reply via email to