Just another voice to join the choir...

Our BGQ using GPFS was switched on last week and my HDF5 performance has been 
around 10% of IOR - which compares to the cray running lustre where we get 60% 
IOR or thereabouts.

This morning, I did some poking around and found that the cmake based configure 
of hdf has a nasty bug that causes H5_HAVE_GPFS to be set to false and no GPFS 
optimizations are compiled in (libgpfs is not detected). Having tweaked that, 
you can imagine my happiness when I recompiled everything and now I'm getting 
even worse Bandwidth.

In fact if I enable collective IO, the app coredumps on me, so the situations 
is worse than I had feared. I'm using too much memory in my test I suspect and 
collectives are pushing me over the limit. The only test I can run with 
collective enabled is the one that uses only one rank and writes 16MB!

Looks like I'm going to have to spend quite a bit more time looking at this.

If anyone else is making tweaks to the hdf5 source, please let me know as I 
don't want to duplicate what anyone else is doing, but I'll be happy to help 
track down issues.

Rob : you mentioned some fcntl functions were deprecated etc. do I need to 
remove these to stop the coredumps? (I'm very much hoping something has gone 
wrong with my tests because the performance is shockingly bad  ... ) (NB. my 
Version is 1.8.12-snap17)

JB

> -----Original Message-----
> From: Hdf-forum [mailto:[email protected]] On Behalf
> Of Daniel Langr
> Sent: 20 September 2013 13:46
> To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single
> (shared) file
> 
> Rob,
> 
> thanks a lot for hints. I will look at the suggested option and try some
> experiments with it :).
> 
> Daniel
> 
> 
> 
> Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> >> separate files: 1.36 [s]
> >> single file, 1 stripe: 133.6 [s]
> >> single file, best result: 17.2 [s]
> >>
> >> (I did multiple runs with various combinations of strip count and
> >> size, presenting the best results I have obtained.)
> >>
> >> Increasing the number of stripes obviously helped a lot, but
> >> comparing with the separate-files strategy, the writing time is still
> >> more than ten times slower . Do you think it is "normal"?
> >
> > It might be "normal" for Lustre, but it's not good.  I wish I had more
> > experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> > The ADIOS folks report tuned-HDF5 to a single shared file runs about
> > 60% slower than ADIOS to multiple files, not 10x slower, so it seems
> > there is room for improvement.
> >
> > I've asked them about the kinds of things "tuned HDF5" entails, and
> > they didn't know (!).
> >
> > There are quite a few settings documented in the intro_mpi(3) man
> > page.  MPICH_MPIIO_CB_ALIGN will probably be the most important thing
> > you can try.  I'm sorry to report that in my limited experience, the
> > documentation and reality are sometimes out of sync, especially with
> > respect to which settings are default or not.
> >
> > ==rob
> >
> >> Thanks,
> >> Daniel
> >>
> >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> >>> I've run some benchmark, where within an MPI program, each process
> >>> wrote
> >>> 3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the
> >>> following writing strategies:
> >>>
> >>> 1) each process writes to its own file,
> >>> 2) each process writes to the same file to its own dataset,
> >>> 3) each process writes to the same file to a same dataset.
> >>>
> >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024),
> >>> and I've tested 2)-3) for both independent/collective options of the
> >>> MPI driver. I've also used 3 different clusters for measurements
> >>> (all quite modern).
> >>>
> >>> As a result, the running (storage) times of the same-file strategy, i.e.
> >>> 2) and 3), were of orders of magnitudes longer than the running
> >>> times of the separate-files strategy. For illustration:
> >>>
> >>> cluster #1, 512 MPI processes, each process stores 100 MB of data,
> >>> fixed data sets:
> >>>
> >>> 1) separate files: 2.73 [s]
> >>> 2) single file, independent calls, separate data sets: 88.54[s]
> >>>
> >>> cluster #2, 256 MPI processes, each process stores 100 MB of data,
> >>> chunked data sets (chunk size 1024):
> >>>
> >>> 1) separate files: 10.40 [s]
> >>> 2) single file, independent calls, shared data sets: 295 [s]
> >>> 3) single file, collective calls, shared data sets: 3275 [s]
> >>>
> >>> Any idea why the single-file strategy gives so poor writing performance?
> >>>
> >>> Daniel
> >>
> >> _______________________________________________
> >> Hdf-forum is for HDF software users discussion.
> >> [email protected]
> >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgr
> >> oup.org
> >
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Reply via email to