Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Leigh Orf Tue, 12 Apr 2011 12:12:36 -0700

On Tue, Apr 12, 2011 at 11:57 AM, Mark Miller <[email protected]> wrote:


>
> On Tue, 2011-04-12 at 10:39 -0700, Leigh Orf wrote:
>
> >
> >
> > Understand that I just discovered the ability to do buffered I/O with
> > hdf5. I wasn't aware of the core serial driver until Friday!
>
> Yeah, there are a lot of interesting dark corners of HDF5 library that
> are useful to know about. Core driver is definitely one of them. That
> has saved my behind a few times when we've been in a bind on
> performance.
>

Indeed. Write operations go pretty fast when there is no actual disk access!


>
>
> > I am going to looking carefully at your code. At first glance, it
> > appears to be a similar approach to what I have tried but in my case I
> > created new MPI communicators which spanned any number of cores (but
> > it has to divide evenly into the full problem, unlike with your
> > approach). In my case, each subcommunicator would use pHDF5 collective
> > calls to concurrently write to its own file, and I could choose the
> > number of files. I still had lousy performance with all my choices of
> > number of files.
> >
> > It is not entirely clear to me that you are doing true collective
> > parallel HDF5 (where I have had problems but have been led to believe
>
> Thats right. There is NOTHING I/O-wise that is parallel. That code is
> designed to work with SERIAL compiled HDF5. The only parallel parts are
> the file management to orchestrate parallel I/O to multiple files
> concurrently. It is the 'Poor Mans' approach to parallel I/O. It is
> described in the pmpio.h header file a bit and more here...
>
>
> http://visitbugs.ornl.gov/projects/hpc-hdf5/wiki/Poor_Man's_vs_Rich_Mans'_Parallel_IO
>
>
Yes, I am familiar the concept and have found that link before. It seems
we're at a point with HPC where there are still unanswered questions about
the fastest and most practical way to get data from CPU space to diskspace
in a way that also makes post-processing sufficiently simple.

I see more clearly now what you're doing. You store each core's chunk of the
data as a hdf5 group in a file, and only one core accesses a file at a time.
So each core gets its baton, opens, writes, closes, and hands off. I am
pretty sure there is very little overhead for opens and closes when a single
core is doing it (as opposed to thousands of concurrent opens, for instance)
but perhaps it will be non-negligible when multiplied by 100,000?

Your files contain multiple groups representing spatial things, my latest
attempt is multiple groups representing spatial information at a given
location.  I see no reason why you couldn't have both multiple times (one
group hierarchy) and core contributions (another one) in a single file, all
designated by hdf5 groups.

Your approach is one of the few remaining I have yet to try. I'm not sure it
will give better I/O performance than one file per core, and that is my main
concern right now. I am kind of used to using actual unix directories in the
way that you are using hdf5 groups, to some degree, to reduce the number of
files in a given directory.

I notice you are a VisIt developer as well (I think we've corresponded
before). Is there a VisIt plugin using hdf5 for your PMMPI format? This
could be an additional motivator for me to try your approach, as I will be
using VisIt for visualization on huge runs soon. I've twice developed VisIt
plugins (hdf4, then 5) but only for file-per-core and my C++ skills are
somewhat lacking.

Leigh


> >  it is a path to happiness) as you do not call h5pset_dxpl_mpio and
> > set the H5FD_MPIO_COLLECTIVE flag. You also do not construct a
> > property list and pass it to h5dwrite, instructing each I/O core to
> > write its own piece of a hdf5 file using offset arrays,
> > h5sselect_hyperslab calls etc., which is what the examples I have
> > found led me to. It seems you are effectively doing serial hdf5 in
> > parallel, which is what I am leaning towards at this point. Your
> > approach is more elegant than mine but I am (a) stuck with fortran and
> > (b) not a programmer by training, although C is my preferred language
> > for I/O. Not sure if I could call your code from fortran easily
> > without going through contortions (again forgive me, I am a weather
> > guy who pretends he is a programmer).
> >
> > I fully embraced parallel hdf5 because I thought it could give me all
> > the flexibility I needed to essentially tune
>
> So, I find the all-collective-all-the-time API for parallel HDF5 to be
> way too 'inflexible' to handle sophisticated I/O patterns where data
> type, size and shape and existence even vary substantially from
> processor-to-processor. For bread-and-butter data parallel apps where
> essentially the same few data structures (distributed arrayys) are
> distributed across processors, it works ok. But, none of the simulation
> apps I support have that kind of a (simple) I/O pattern nor even
> approximate it, especially for plot outputs.
>
>
I think pHDF5 is very neat and useful in some situations. My own experience
is that with a modest number of cores (around 1k) performance is adequate,
but for whatever reason bumping it up another order of magnitude leads to
badness.




>
> --
> Mark C. Miller, Lawrence Livermore National Laboratory
> ================!!LLNL BUSINESS ONLY!!================
> [email protected]      urgent: [email protected]
> T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511
>
>


-- 
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config

Reply via email to