Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Rob Latham Wed, 18 Sep 2013 07:05:37 -0700

On Tue, Sep 17, 2013 at 02:41:13PM -0500, David Knaak wrote:
> Rob, Daniel, et al.,
> 
> I have started looking into this HDF5 performance issue on Cray systems.
> I am looking "under the covers" to determine where the bottlenecks are
> and what might be done about them.  Here are some preliminary comments.

Awesome! with David Knaak on the case this is going to set sorted out
toot suite.

> * Shared file performance will almost always be slower than file per 
> process performance because of the need for extent locking by the file
> system.

The extent locks are a problem, yes, but let's not discount the
overhead of creating N files.  Now, on GPFS the story is abysmal,
and maybe Lustre creates files in /close to/ no time flat, but
creating a file does cost something.

> * Shared file I/O using MPI I/O with collective buffering can usually
> achieve better than 50% of file per process if the file accesses are 
> contiguous after aggregation.  

Agreed! Early (like 2008-era) Cray MPI-IO had a collective I/O
algorithm that was not well suited to Lustre.  That's no longer the
case, and has not been since MPT-3.2 but those initial poor
experiences entered folklore and now "collective I/O is slow" is what
everyone thinks, even 6 years and two machines later. 

> * What I am seeing so far analyzing simple IOR runs, comparing the MPIIO
> interface and the HDF5 interface, is that the MPI_File_set_size() call
> and the metadata writes done by HDF5 are both taking a lot of extra time.

Quincey will have to speak to this one, but I thought they greatly
reduced the number of MPI_File_set_size() calls in a recent release?

> * MPI_File_set_size eventually calls ftruncate(), which has been reported
> to take a long time on Lustre file systems.

The biggest problem with MPI_File_set_size and ftruncate is that
MPI_File_set_size is collective.   I don't know what changes Cray's
made to ROMIO, but for a long time ROMIO's had a "call ftruncate on
one processor" optimization.  David can confirm if ADIOI_GEN_Resize or
it's equivalent contains that optimization.

> * The metadata is written by multiple processes in small records to the 
> same regions of the file.  Some metadata always goes to the beginning of
> the file but some is written to other parts of the file.  Both cause a
> lot of lock contention, which slows performance.

I've bugged the HDF5 guys about this since 2008.  It's work in
progress under ExaHDF5 (i think), so there's hope that we will see a
scalable metadata approach soon.

> I still need to verify what I think I am seeing.  I don't know yet what
> can be done about either of these.  But 5x or 10x slowdown is not
> acceptable.

One thing that is missing is a good understanding of what the MPT
tuning knobs can and cannot do.   Another thing that is missing is a
way to detangle the I/O stack: if we had some way to say "this app
spent X% in hdf5-related things, Y% in MPI-IO things and Z% in
lustre-things", that would go a long way towards directing effort.

Have you seen the work we've done witn Darshan lately?  Darshan had
some bad experiences on Lustre a few years back, but Phil Carns and
Yushu Yao have really whipped it into shape for Hopper (see phil and
yushu's recent CUG paper).  It'd be nice to have Darshan on more Cray
systems.  It's been a huge asset on Argonne's Blue Gene machines.

==rob

> 
> On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:
> > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > > separate files: 1.36 [s]
> > > single file, 1 stripe: 133.6 [s]
> > > single file, best result: 17.2 [s]
> > > 
> > > (I did multiple runs with various combinations of strip count and
> > > size, presenting the best results I have obtained.)
> > > 
> > > Increasing the number of stripes obviously helped a lot, but
> > > comparing with the separate-files strategy, the writing time is
> > > still more than ten times slower . Do you think it is "normal"?
> > 
> > It might be "normal" for Lustre, but it's not good.  I wish I had
> > more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> > The ADIOS folks report tuned-HDF5 to a single shared file runs about
> > 60% slower than ADIOS to multiple files, not 10x slower, so it seems
> > there is room for improvement.
> > 
> > I've asked them about the kinds of things "tuned HDF5" entails, and
> > they didn't know (!). 
> > 
> > There are quite a few settings documented in the intro_mpi(3) man
> > page.  MPICH_MPIIO_CB_ALIGN will probably be the most important thing
> > you can try.  I'm sorry to report that in my limited experience, the
> > documentation and reality are sometimes out of sync, especially with
> > respect to which settings are default or not.
> > 
> > ==rob
> > 
> > > Thanks,
> > > Daniel
> > > 
> > > Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > > >I've run some benchmark, where within an MPI program, each process wrote
> > > >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
> > > >writing strategies:
> > > >
> > > >1) each process writes to its own file,
> > > >2) each process writes to the same file to its own dataset,
> > > >3) each process writes to the same file to a same dataset.
> > > >
> > > >I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
> > > >I've tested 2)-3) for both independent/collective options of the MPI
> > > >driver. I've also used 3 different clusters for measurements (all quite
> > > >modern).
> > > >
> > > >As a result, the running (storage) times of the same-file strategy, i.e.
> > > >2) and 3), were of orders of magnitudes longer than the running times of
> > > >the separate-files strategy. For illustration:
> > > >
> > > >cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
> > > >data sets:
> > > >
> > > >1) separate files: 2.73 [s]
> > > >2) single file, independent calls, separate data sets: 88.54[s]
> > > >
> > > >cluster #2, 256 MPI processes, each process stores 100 MB of data,
> > > >chunked data sets (chunk size 1024):
> > > >
> > > >1) separate files: 10.40 [s]
> > > >2) single file, independent calls, shared data sets: 295 [s]
> > > >3) single file, collective calls, shared data sets: 3275 [s]
> > > >
> > > >Any idea why the single-file strategy gives so poor writing performance?
> > > >
> > > >Daniel
> > > 
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > [email protected]
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > 
> > -- 
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> > 
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> 

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Reply via email to