Re: [Hdf-forum] paging approaches

Philip Winston Tue, 07 Dec 2010 04:31:00 -0800

I had not looked at the core driver, thanks.

It seems like a useful thing to be aware of in general but I don't think it
helps in my case.  It sounds like it is useful mainly for writing, writing
an HDF5 in memory.


But if you have a big HDF5 on disk, I don't see how the core driver helps
you access it.  You could copy the whole thing to an in-memory file but we
don't want a big startup hit like that.  But maybe I am missing a way to use
the core driver here.

-Philip



On Mon, Dec 6, 2010 at 10:08 PM, Quincey Koziol <[email protected]> wrote:

> Hi Philip,
> Have you considered using the 'core' file driver (H5Pset_fapl_core)?
>
> Quincey
>
> On Dec 6, 2010, at 6:52 PM, Philip Winston wrote:
>
> I mean the code I gave you mmaps the file as a
>> whole, not individual datasets in the file. But, it nonetheless mmaps
>> UNDERNEATH the explicit reads/writes (e.g. H5Dread/H5Dwrite calls) made
>> by the application. So, I am thinking this is nowhere near the paradigm
>> you were hoping for.
>>
>
> I was hoping for a true mmap model.  But now I see perhaps that is
> impossible.  mmap only works if what is in memory is identical to what's on
> disk,  for HDF5 endianness alone can break this assumption right?  Plus lots
> of other things like chunked datasets.
>
> So for my situation one option is keep HDF5 around for interchange, but for
> runtime "optimize" to a simple binary format where I can mmap the entire
> dataset.  Then I can just read/write anywhere and the OS takes care of
> everything.
>
> It's tempting to me, coming from a situation where everything is in RAM
> today, it seems like the least work to continue to access randomly and let
> the OS figured it out. But I don't know how smart it is.  Maybe it is kind
> of a red herring, like that would work but it would perform horribly.  Maybe
> coming from a situation where everything is in RAM, we have to rethink
> things a lot to make it work off disk, to organize stuff for coherence, so
> we can read big chunks instead of single rows.
>
> My experience is that for simple queries (give me this hyperslab of
>> data), products like HDF5 are going to give better I/O performance than
>> some RDBMS. But, if you are really talking about highly sophisticated
>> queries where future reads/writes depend upon other parts of the query
>> and the datasets being queried, that sounds more like an RDBMS than an
>> I/O library sort of thing. Just my two cents. Good luck.
>>
>
> Our data is essentially a tabular representation of a tree.  Every row is a
> node in the tree.  There are 2-10 values in a row, but tens of millions of
> rows.  So in a sense our queries do depend on values as we read them,
> because for example we'll read a value, find the children of a node, read
> those values, etc. etc.
>
> I imagine HDF5 being best for reading large amounts of data each time.  We
> would generally always be reading 1 row at a time.  Set up one hyperslab,
> tiny read, new hyperslab, tiny read.
>
>  We have other uses in mind for HDF5 but this particular type of a data I
> wonder, maybe it's just not a good fit.
>
> -Philip
>
>
>
> > On Mon, Dec 6, 2010 at 3:21 PM, Mark Miller <[email protected]> wrote:
>> >         I am not sure if you got an answer to this email and so I
>> >         thought I
>> >         would pipe up.
>> >
>> >         Yes, you can do mmap if you'd like. I took HDF5's sec2 Virtual
>> >         File
>> >         Driver (VFD) and tweeked it to use mmap instead just to test
>> >         how
>> >         something like this would work. I've attached the (hacked)
>> >         code. To use
>> >         it, you are going to have to learn a bit about HDF5 VFDs.
>> >         Learn about
>> >         them in File Access Property lists,
>> >         http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html, as well as
>> >
>> >         http://www.hdfgroup.org/HDF5/doc/TechNotes/VFL.html
>> >
>> >
>> >         It is something to start with. I don't know if HDF5 has plans
>> >         for
>> >         writing an mmap based VFD but they really ought to and it is
>> >         something
>> >         that is definitely lacking from their supported VFDs
>> >         currently.
>> >
>> >         Mark
>> >
>> >         On Fri, 2010-12-03 at 17:02, Philip Winston wrote:
>> >         > We just added HDF5 support in our application.  We are using
>> >         the C
>> >         > API. Our datasets are 1D and 2D arrays of integers, a pretty
>> >         simple
>> >         > structure on disk. Today we have about 5GB of data and we
>> >         load the
>> >         > whole thing into RAM, do somewhat random reads, make
>> >         changes, then
>> >         > overwrite the old .h5 file.
>> >         >
>> >         > I only learned a very minimum amount of the HDF5 API to
>> >         accomplish the
>> >         > above, and it was pretty easy.  Now we are looking at
>> >         supporting much
>> >         > larger datasets, such that it will no longer be practical to
>> >         have the
>> >         > whole thing in memory.  This is where I'm confused on
>> >         exactly what
>> >         > HDF5 offers vs. what is up to the application, and on what's
>> >         the best
>> >         > way to do things in the application.
>> >         >
>> >         > Ideally in my mind what I want is an mmap like interface,
>> >         just a raw
>> >         > pointer which "magically" pages stuff off disk in response
>> >         to reads,
>> >         > and writes stuff back to disk in response to writes.  Does
>> >         HDF5 have
>> >         > something like this, or can/do people end up writing
>> >         something like
>> >         > this on top of HDF5?  Today our datasets our contiguous and
>> >         I assuming
>> >         > we'd want chunked datasets instead, but it's not clear to me
>> >         how much
>> >         > "paging" functionality chunked buys you and how much you
>> >         have to
>> >         > implement.
>> >         >
>> >         > Thanks for any ideas or pointers.
>> >         >
>> >         > -Philip
>> >
>> >         --
>> >         Mark C. Miller, Lawrence Livermore National Laboratory
>> >         ================!!LLNL BUSINESS ONLY!!================
>> >         [email protected]      urgent: [email protected]
>> >         T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511
>> >
>> >         _______________________________________________
>> >         Hdf-forum is for HDF software users discussion.
>> >         [email protected]
>> >
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>> >
>> --
>> Mark C. Miller, Lawrence Livermore National Laboratory
>> ================!!LLNL BUSINESS ONLY!!================
>> [email protected]      urgent: [email protected]
>> T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] paging approaches

Reply via email to