Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

David Knaak Wed, 18 Sep 2013 16:45:32 -0700

Hi Rob,

> On Tue, Sep 17, 2013 at 02:41:13PM -0500, David Knaak wrote:
> > * Shared file performance will almost always be slower than file per 
> > process performance because of the need for extent locking by the file
> > system.
 
On Wed, Sep 18, 2013 at 09:04:49AM -0500, Rob Latham wrote:
> The extent locks are a problem, yes, but let's not discount the
> overhead of creating N files.  Now, on GPFS the story is abysmal,
> and maybe Lustre creates files in /close to/ no time flat, but
> creating a file does cost something.


Lustre is pretty fast with file creates.  We are measuring on the order
of 20,000 creates/second on some configurations.  But that still means
50 seconds for a million files (I'm thinking exascale).  Worse is just
the management of that many files.  So I am definitely an advocate of
single shared files.  But users, of course, want the best of both worlds,
single file at FPP speed.  I believe they will, in reality, accept 50%,
but not 10% or 1%.
 
> > * What I am seeing so far analyzing simple IOR runs, comparing the MPIIO
> > interface and the HDF5 interface, is that the MPI_File_set_size() call
> > and the metadata writes done by HDF5 are both taking a lot of extra time.
 
> Quincey will have to speak to this one, but I thought they greatly
> reduced the number of MPI_File_set_size() calls in a recent release?

Yes, they did.  At least in my IOR test, there was just one call made by
each rank.  But that still took a lot of time.  See next comment also.

> > * MPI_File_set_size eventually calls ftruncate(), which has been reported
> > to take a long time on Lustre file systems.
 
> The biggest problem with MPI_File_set_size and ftruncate is that
> MPI_File_set_size is collective.   I don't know what changes Cray's
> made to ROMIO, but for a long time ROMIO's had a "call ftruncate on
> one processor" optimization.  David can confirm if ADIOI_GEN_Resize or
> it's equivalent contains that optimization.

Yes, the Cray implementation does have that optimization.  Given that,
it is still very surprising that just one call by one rank will still
take so much time.  
 
> > * The metadata is written by multiple processes in small records to the 
> > same regions of the file.  Some metadata always goes to the beginning of
> > the file but some is written to other parts of the file.  Both cause a
> > lot of lock contention, which slows performance.
 
> I've bugged the HDF5 guys about this since 2008.  It's work in
> progress under ExaHDF5 (i think), so there's hope that we will see a
> scalable metadata approach soon.

I have begun a conversation with the HDF Group about this.  Perhaps some
help from me on the MPI-IO side will make it easier for them to do it
sooner.

> > I still need to verify what I think I am seeing.  I don't know yet what
> > can be done about either of these.  But 5x or 10x slowdown is not
> > acceptable.
 
> One thing that is missing is a good understanding of what the MPT
> tuning knobs can and cannot do.   Another thing that is missing is a
> way to detangle the I/O stack: if we had some way to say "this app
> spent X% in hdf5-related things, Y% in MPI-IO things and Z% in
> lustre-things", that would go a long way towards directing effort.

I used a combination of some internal tools to get a very clear picture
of where the time is spent.  From this analysis, I have concluded that 
there is nothing that MPI-IO can do to improve this.  It will require,
I believe, a change to HDF5 for the metadata issue so that the metadata
can be aggregated, and a change to Lustre so that the ftruncate isn't so
slow.  I will also working the Lustre issue with Lustre developers.

> Have you seen the work we've done witn Darshan lately?  Darshan had
> some bad experiences on Lustre a few years back, but Phil Carns and
> Yushu Yao have really whipped it into shape for Hopper (see phil and
> yushu's recent CUG paper).  It'd be nice to have Darshan on more Cray
> systems.  It's been a huge asset on Argonne's Blue Gene machines.

I became aware of Darshan a while ago but until this week, I have not
used it.  I have now built it and will begin using it to see what else
I can learn about the HDF5 performance from it.

Thanks for your comments.
David

> > On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:
> > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > > > separate files: 1.36 [s]
> > > > single file, 1 stripe: 133.6 [s]
> > > > single file, best result: 17.2 [s]
> > > > 
> > > > (I did multiple runs with various combinations of strip count and
> > > > size, presenting the best results I have obtained.)
> > > > 
> > > > Increasing the number of stripes obviously helped a lot, but
> > > > comparing with the separate-files strategy, the writing time is
> > > > still more than ten times slower . Do you think it is "normal"?
> > > 
> > > It might be "normal" for Lustre, but it's not good.  I wish I had
> > > more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> > > The ADIOS folks report tuned-HDF5 to a single shared file runs about
> > > 60% slower than ADIOS to multiple files, not 10x slower, so it seems
> > > there is room for improvement.
> > > 
> > > I've asked them about the kinds of things "tuned HDF5" entails, and
> > > they didn't know (!). 
> > > 
> > > There are quite a few settings documented in the intro_mpi(3) man
> > > page.  MPICH_MPIIO_CB_ALIGN will probably be the most important thing
> > > you can try.  I'm sorry to report that in my limited experience, the
> > > documentation and reality are sometimes out of sync, especially with
> > > respect to which settings are default or not.
> > > 
> > > ==rob
> > > 
> > > > Thanks,
> > > > Daniel
> > > > 
> > > > Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > > > >I've run some benchmark, where within an MPI program, each process 
> > > > >wrote
> > > > >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the 
> > > > >following
> > > > >writing strategies:
> > > > >
> > > > >1) each process writes to its own file,
> > > > >2) each process writes to the same file to its own dataset,
> > > > >3) each process writes to the same file to a same dataset.
> > > > >
> > > > >I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), 
> > > > >and
> > > > >I've tested 2)-3) for both independent/collective options of the MPI
> > > > >driver. I've also used 3 different clusters for measurements (all quite
> > > > >modern).
> > > > >
> > > > >As a result, the running (storage) times of the same-file strategy, 
> > > > >i.e.
> > > > >2) and 3), were of orders of magnitudes longer than the running times 
> > > > >of
> > > > >the separate-files strategy. For illustration:
> > > > >
> > > > >cluster #1, 512 MPI processes, each process stores 100 MB of data, 
> > > > >fixed
> > > > >data sets:
> > > > >
> > > > >1) separate files: 2.73 [s]
> > > > >2) single file, independent calls, separate data sets: 88.54[s]
> > > > >
> > > > >cluster #2, 256 MPI processes, each process stores 100 MB of data,
> > > > >chunked data sets (chunk size 1024):
> > > > >
> > > > >1) separate files: 10.40 [s]
> > > > >2) single file, independent calls, shared data sets: 295 [s]
> > > > >3) single file, collective calls, shared data sets: 3275 [s]
> > > > >
> > > > >Any idea why the single-file strategy gives so poor writing 
> > > > >performance?
> > > > >
> > > > >Daniel
> > > > 
> > > > _______________________________________________
> > > > Hdf-forum is for HDF software users discussion.
> > > > [email protected]
> > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > > 
> > > -- 
> > > Rob Latham
> > > Mathematics and Computer Science Division
> > > Argonne National Lab, IL USA
> > > 
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > [email protected]
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > 
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

-- 

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Reply via email to