Hi Philip,
Have you considered using the 'core' file driver (H5Pset_fapl_core)?
Quincey
On Dec 6, 2010, at 6:52 PM, Philip Winston wrote:
> I mean the code I gave you mmaps the file as a
> whole, not individual datasets in the file. But, it nonetheless mmaps
> UNDERNEATH the explicit reads/writes (e.g. H5Dread/H5Dwrite calls) made
> by the application. So, I am thinking this is nowhere near the paradigm
> you were hoping for.
>
> I was hoping for a true mmap model. But now I see perhaps that is
> impossible. mmap only works if what is in memory is identical to what's on
> disk, for HDF5 endianness alone can break this assumption right? Plus lots
> of other things like chunked datasets.
>
> So for my situation one option is keep HDF5 around for interchange, but for
> runtime "optimize" to a simple binary format where I can mmap the entire
> dataset. Then I can just read/write anywhere and the OS takes care of
> everything.
>
> It's tempting to me, coming from a situation where everything is in RAM
> today, it seems like the least work to continue to access randomly and let
> the OS figured it out. But I don't know how smart it is. Maybe it is kind of
> a red herring, like that would work but it would perform horribly. Maybe
> coming from a situation where everything is in RAM, we have to rethink things
> a lot to make it work off disk, to organize stuff for coherence, so we can
> read big chunks instead of single rows.
>
> My experience is that for simple queries (give me this hyperslab of
> data), products like HDF5 are going to give better I/O performance than
> some RDBMS. But, if you are really talking about highly sophisticated
> queries where future reads/writes depend upon other parts of the query
> and the datasets being queried, that sounds more like an RDBMS than an
> I/O library sort of thing. Just my two cents. Good luck.
>
> Our data is essentially a tabular representation of a tree. Every row is a
> node in the tree. There are 2-10 values in a row, but tens of millions of
> rows. So in a sense our queries do depend on values as we read them, because
> for example we'll read a value, find the children of a node, read those
> values, etc. etc.
>
> I imagine HDF5 being best for reading large amounts of data each time. We
> would generally always be reading 1 row at a time. Set up one hyperslab,
> tiny read, new hyperslab, tiny read.
>
> We have other uses in mind for HDF5 but this particular type of a data I
> wonder, maybe it's just not a good fit.
>
> -Philip
>
>
>
> > On Mon, Dec 6, 2010 at 3:21 PM, Mark Miller <[email protected]> wrote:
> > I am not sure if you got an answer to this email and so I
> > thought I
> > would pipe up.
> >
> > Yes, you can do mmap if you'd like. I took HDF5's sec2 Virtual
> > File
> > Driver (VFD) and tweeked it to use mmap instead just to test
> > how
> > something like this would work. I've attached the (hacked)
> > code. To use
> > it, you are going to have to learn a bit about HDF5 VFDs.
> > Learn about
> > them in File Access Property lists,
> > http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html, as well as
> >
> > http://www.hdfgroup.org/HDF5/doc/TechNotes/VFL.html
> >
> >
> > It is something to start with. I don't know if HDF5 has plans
> > for
> > writing an mmap based VFD but they really ought to and it is
> > something
> > that is definitely lacking from their supported VFDs
> > currently.
> >
> > Mark
> >
> > On Fri, 2010-12-03 at 17:02, Philip Winston wrote:
> > > We just added HDF5 support in our application. We are using
> > the C
> > > API. Our datasets are 1D and 2D arrays of integers, a pretty
> > simple
> > > structure on disk. Today we have about 5GB of data and we
> > load the
> > > whole thing into RAM, do somewhat random reads, make
> > changes, then
> > > overwrite the old .h5 file.
> > >
> > > I only learned a very minimum amount of the HDF5 API to
> > accomplish the
> > > above, and it was pretty easy. Now we are looking at
> > supporting much
> > > larger datasets, such that it will no longer be practical to
> > have the
> > > whole thing in memory. This is where I'm confused on
> > exactly what
> > > HDF5 offers vs. what is up to the application, and on what's
> > the best
> > > way to do things in the application.
> > >
> > > Ideally in my mind what I want is an mmap like interface,
> > just a raw
> > > pointer which "magically" pages stuff off disk in response
> > to reads,
> > > and writes stuff back to disk in response to writes. Does
> > HDF5 have
> > > something like this, or can/do people end up writing
> > something like
> > > this on top of HDF5? Today our datasets our contiguous and
> > I assuming
> > > we'd want chunked datasets instead, but it's not clear to me
> > how much
> > > "paging" functionality chunked buys you and how much you
> > have to
> > > implement.
> > >
> > > Thanks for any ideas or pointers.
> > >
> > > -Philip
> >
> > --
> > Mark C. Miller, Lawrence Livermore National Laboratory
> > ================!!LLNL BUSINESS ONLY!!================
> > [email protected] urgent: [email protected]
> > T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> >
> --
> Mark C. Miller, Lawrence Livermore National Laboratory
> ================!!LLNL BUSINESS ONLY!!================
> [email protected] urgent: [email protected]
> T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org