Re: [Hdf-forum] paging approaches

Mark Miller Mon, 06 Dec 2010 16:13:13 -0800

On Mon, 2010-12-06 at 15:57, Philip Winston wrote:
> Thanks for the info and code!


You're welcome.

> 
> Given this mmap VFD isn't yet part of the library, I'm wondering does
> anyone do what we're talking today, with the existing HDF5 library?

So, I guess I could be totally confused here. Unfortunately, my week is
so busy I won't have time to discuss/debate all the good questions
you've asked. Hopefully someone else might.

In the interim, I don't think you can avoid explicitly reading/writing
parts of your data. I mean the code I gave you mmaps the file as a
whole, not individual datasets in the file. But, it nonetheless mmaps
UNDERNEATH the explicit reads/writes (e.g. H5Dread/H5Dwrite calls) made
by the application. So, I am thinking this is nowhere near the paradigm
you were hoping for.

You can do partial reads and writes WITHOUT resorting to chunked
datasets. You would need chunked datasets ONLY if you expect the size of
the dataset to vary over time and/or you are using various filters upon
it during I/O (e.g. compression, checksumming). At the same time, there
may be no harm in chunking your dataset.

I don't know if chunking the dataset and then optimizing your partial
reads/writes around the chunk structure would be 'better' than NOT
chunking it and just relying upon HDF5's partial I/O capabilities on the
unchunked dataset. 

>   To summarize we have a dataset that doesn't fit in memory.  We want
> to "randomly" perform reads, reading only portions into RAM.  Then we
> make changes in  RAM.  Then we want to write out only the changed
> portions. 
> 
> I'm guessing a chunked file is the starting point here, but what else
> is needed?  Is there a layer on top to coordinate things?  To hold a
> list of modified chunks?
> 
> Is it even a good idea to attempt this usage model with HDF5? I read
> one person suggest using HDF5 is good for bulk read-only data but that
> he would use a database for "complex" data that requires changes.  I
> wonder our situation is just better suited to a database?
> 
> Where do people draw the line?  What do you consider appropriate usage
> model for HDF5 vs. a database or something else?  Thanks for any input
> we have "adopted" HDF5 but really we don't understand it that well
> yet.

I think that depends on how complex your 'queries' are going to be and
how much that query optimization could be exploited to improve I/O.

My experience is that for simple queries (give me this hyperslab of
data), products like HDF5 are going to give better I/O performance than
some RDBMS. But, if you are really talking about highly sophisticated
queries where future reads/writes depend upon other parts of the query
and the datasets being queried, that sounds more like an RDBMS than an
I/O library sort of thing. Just my two cents. Good luck.

> 
> -Philip
> 
> 
> On Mon, Dec 6, 2010 at 3:21 PM, Mark Miller <[email protected]> wrote:
>         I am not sure if you got an answer to this email and so I
>         thought I
>         would pipe up.
>         
>         Yes, you can do mmap if you'd like. I took HDF5's sec2 Virtual
>         File
>         Driver (VFD) and tweeked it to use mmap instead just to test
>         how
>         something like this would work. I've attached the (hacked)
>         code. To use
>         it, you are going to have to learn a bit about HDF5 VFDs.
>         Learn about
>         them in File Access Property lists,
>         http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html, as well as
>         
>         http://www.hdfgroup.org/HDF5/doc/TechNotes/VFL.html
>         
>         
>         It is something to start with. I don't know if HDF5 has plans
>         for
>         writing an mmap based VFD but they really ought to and it is
>         something
>         that is definitely lacking from their supported VFDs
>         currently.
>         
>         Mark
>         
>         On Fri, 2010-12-03 at 17:02, Philip Winston wrote:
>         > We just added HDF5 support in our application.  We are using
>         the C
>         > API. Our datasets are 1D and 2D arrays of integers, a pretty
>         simple
>         > structure on disk. Today we have about 5GB of data and we
>         load the
>         > whole thing into RAM, do somewhat random reads, make
>         changes, then
>         > overwrite the old .h5 file.
>         >
>         > I only learned a very minimum amount of the HDF5 API to
>         accomplish the
>         > above, and it was pretty easy.  Now we are looking at
>         supporting much
>         > larger datasets, such that it will no longer be practical to
>         have the
>         > whole thing in memory.  This is where I'm confused on
>         exactly what
>         > HDF5 offers vs. what is up to the application, and on what's
>         the best
>         > way to do things in the application.
>         >
>         > Ideally in my mind what I want is an mmap like interface,
>         just a raw
>         > pointer which "magically" pages stuff off disk in response
>         to reads,
>         > and writes stuff back to disk in response to writes.  Does
>         HDF5 have
>         > something like this, or can/do people end up writing
>         something like
>         > this on top of HDF5?  Today our datasets our contiguous and
>         I assuming
>         > we'd want chunked datasets instead, but it's not clear to me
>         how much
>         > "paging" functionality chunked buys you and how much you
>         have to
>         > implement.
>         >
>         > Thanks for any ideas or pointers.
>         >
>         > -Philip
>         
>         --
>         Mark C. Miller, Lawrence Livermore National Laboratory
>         ================!!LLNL BUSINESS ONLY!!================
>         [email protected]      urgent: [email protected]
>         T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511
>         
>         _______________________________________________
>         Hdf-forum is for HDF software users discussion.
>         [email protected]
>         http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>         
-- 
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
[email protected]      urgent: [email protected]
T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] paging approaches

Reply via email to