Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

David Knaak Wed, 18 Sep 2013 06:46:00 -0700

Rob, Daniel, et al.,

I have started looking into this HDF5 performance issue on Cray systems.
I am looking "under the covers" to determine where the bottlenecks are
and what might be done about them.  Here are some preliminary comments.


* Shared file performance will almost always be slower than file per 
process performance because of the need for extent locking by the file
system.

* Shared file I/O using MPI I/O with collective buffering can usually
achieve better than 50% of file per process if the file accesses are 
contiguous after aggregation.  

* What I am seeing so far analyzing simple IOR runs, comparing the MPIIO
interface and the HDF5 interface, is that the MPI_File_set_size() call
and the metadata writes done by HDF5 are both taking a lot of extra time.

* MPI_File_set_size eventually calls ftruncate(), which has been reported
to take a long time on Lustre file systems.

* The metadata is written by multiple processes in small records to the 
same regions of the file.  Some metadata always goes to the beginning of
the file but some is written to other parts of the file.  Both cause a
lot of lock contention, which slows performance.

I still need to verify what I think I am seeing.  I don't know yet what
can be done about either of these.  But 5x or 10x slowdown is not
acceptable.

David

On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:
> On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > separate files: 1.36 [s]
> > single file, 1 stripe: 133.6 [s]
> > single file, best result: 17.2 [s]
> > 
> > (I did multiple runs with various combinations of strip count and
> > size, presenting the best results I have obtained.)
> > 
> > Increasing the number of stripes obviously helped a lot, but
> > comparing with the separate-files strategy, the writing time is
> > still more than ten times slower . Do you think it is "normal"?
> 
> It might be "normal" for Lustre, but it's not good.  I wish I had
> more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> The ADIOS folks report tuned-HDF5 to a single shared file runs about
> 60% slower than ADIOS to multiple files, not 10x slower, so it seems
> there is room for improvement.
> 
> I've asked them about the kinds of things "tuned HDF5" entails, and
> they didn't know (!). 
> 
> There are quite a few settings documented in the intro_mpi(3) man
> page.  MPICH_MPIIO_CB_ALIGN will probably be the most important thing
> you can try.  I'm sorry to report that in my limited experience, the
> documentation and reality are sometimes out of sync, especially with
> respect to which settings are default or not.
> 
> ==rob
> 
> > Thanks,
> > Daniel
> > 
> > Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > >I've run some benchmark, where within an MPI program, each process wrote
> > >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the following
> > >writing strategies:
> > >
> > >1) each process writes to its own file,
> > >2) each process writes to the same file to its own dataset,
> > >3) each process writes to the same file to a same dataset.
> > >
> > >I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), and
> > >I've tested 2)-3) for both independent/collective options of the MPI
> > >driver. I've also used 3 different clusters for measurements (all quite
> > >modern).
> > >
> > >As a result, the running (storage) times of the same-file strategy, i.e.
> > >2) and 3), were of orders of magnitudes longer than the running times of
> > >the separate-files strategy. For illustration:
> > >
> > >cluster #1, 512 MPI processes, each process stores 100 MB of data, fixed
> > >data sets:
> > >
> > >1) separate files: 2.73 [s]
> > >2) single file, independent calls, separate data sets: 88.54[s]
> > >
> > >cluster #2, 256 MPI processes, each process stores 100 MB of data,
> > >chunked data sets (chunk size 1024):
> > >
> > >1) separate files: 10.40 [s]
> > >2) single file, independent calls, shared data sets: 295 [s]
> > >3) single file, collective calls, shared data sets: 3275 [s]
> > >
> > >Any idea why the single-file strategy gives so poor writing performance?
> > >
> > >Daniel
> > 
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

-- 

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Reply via email to