I have been following this thread with interest since we have the same issue in 
the synchrotron community, with new detectors generating 100's-1000's of 2D 
frames/sec and total rates approaching 10 GB/sec using multiple parallel 10 GbE 
streams from different detector nodes. What we have found is:

 - Lustre is better at managing the pHDF5 contention between nodes than GPFS is.
 - GPFS is better at streaming data from one node, if there is no contention.
 - Having the nodes write to separate files is better than using pHDF5 to 
enable all nodes to write to one.

"Better" means a factor of 2-3 times, but we are still actively learning and we 
have more experience with Lustre than GPFS, so there may be some GPFS tweaks we 
are missing. The storage systems are comparable, both based on DDN SFA 
architecture and have ample throughput in simple "ior" tests. I think GPFS 
would also be comparable to Lustre at managing contention if all the data 
originated from one node, but we haven't been looking at this.

What we are doing is working with The HDF Group to define a work package dubbed 
"Virtual Datasets" where you can have a virtual dataset in a master file which 
is composed of datasets in underlying files. It is a bit like extending the 
soft-link mechanism to allow unions. The method of mapping the underlying 
datasets onto the virtual dataset is very flexible and so we hope it can be 
used in a number of circumstances. The two main requirements are:

 - The use of the virtual dataset is transparent to any program reading the 
data later.
 - The writing nodes can write their files independently, so don't need pHDF5.

An additional benefit is the data can be compressed, so data rates may be able 
to be reduced drastically by compression, depending on your situation.

The status is that we have Draft RFC outlining the requirements, use cases and 
programming model, and The HDF Group is preparing an estimate. The work is not 
funded (I will be making a case to my directors for some of it), but if it 
strikes a chord I would be only too willing to share the RFC, particularly if 
there is any possibility of support coming available.

Cheers,

Nick Rees
Principal Software Engineer           Phone: +44 (0)1235-778430
Diamond Light Source                  Fax:   +44 (0)1235-446713

-----Original Message-----
From: Hdf-forum [mailto:[email protected]] On Behalf Of 
David Knaak
Sent: 19 September 2013 00:45
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single 
(shared) file

Hi Rob,

> On Tue, Sep 17, 2013 at 02:41:13PM -0500, David Knaak wrote:
> > * Shared file performance will almost always be slower than file per 
> > process performance because of the need for extent locking by the 
> > file system.
 
On Wed, Sep 18, 2013 at 09:04:49AM -0500, Rob Latham wrote:
> The extent locks are a problem, yes, but let's not discount the 
> overhead of creating N files.  Now, on GPFS the story is abysmal, and 
> maybe Lustre creates files in /close to/ no time flat, but creating a 
> file does cost something.

Lustre is pretty fast with file creates.  We are measuring on the order of 
20,000 creates/second on some configurations.  But that still means
50 seconds for a million files (I'm thinking exascale).  Worse is just the 
management of that many files.  So I am definitely an advocate of single shared 
files.  But users, of course, want the best of both worlds, single file at FPP 
speed.  I believe they will, in reality, accept 50%, but not 10% or 1%.
 
> > * What I am seeing so far analyzing simple IOR runs, comparing the 
> > MPIIO interface and the HDF5 interface, is that the 
> > MPI_File_set_size() call and the metadata writes done by HDF5 are both 
> > taking a lot of extra time.
 
> Quincey will have to speak to this one, but I thought they greatly 
> reduced the number of MPI_File_set_size() calls in a recent release?

Yes, they did.  At least in my IOR test, there was just one call made by each 
rank.  But that still took a lot of time.  See next comment also.

> > * MPI_File_set_size eventually calls ftruncate(), which has been 
> > reported to take a long time on Lustre file systems.
 
> The biggest problem with MPI_File_set_size and ftruncate is that
> MPI_File_set_size is collective.   I don't know what changes Cray's
> made to ROMIO, but for a long time ROMIO's had a "call ftruncate on 
> one processor" optimization.  David can confirm if ADIOI_GEN_Resize or 
> it's equivalent contains that optimization.

Yes, the Cray implementation does have that optimization.  Given that, it is 
still very surprising that just one call by one rank will still take so much 
time.  
 
> > * The metadata is written by multiple processes in small records to 
> > the same regions of the file.  Some metadata always goes to the 
> > beginning of the file but some is written to other parts of the 
> > file.  Both cause a lot of lock contention, which slows performance.
 
> I've bugged the HDF5 guys about this since 2008.  It's work in 
> progress under ExaHDF5 (i think), so there's hope that we will see a 
> scalable metadata approach soon.

I have begun a conversation with the HDF Group about this.  Perhaps some help 
from me on the MPI-IO side will make it easier for them to do it sooner.

> > I still need to verify what I think I am seeing.  I don't know yet 
> > what can be done about either of these.  But 5x or 10x slowdown is 
> > not acceptable.
 
> One thing that is missing is a good understanding of what the MPT
> tuning knobs can and cannot do.   Another thing that is missing is a
> way to detangle the I/O stack: if we had some way to say "this app 
> spent X% in hdf5-related things, Y% in MPI-IO things and Z% in 
> lustre-things", that would go a long way towards directing effort.

I used a combination of some internal tools to get a very clear picture of 
where the time is spent.  From this analysis, I have concluded that there is 
nothing that MPI-IO can do to improve this.  It will require, I believe, a 
change to HDF5 for the metadata issue so that the metadata can be aggregated, 
and a change to Lustre so that the ftruncate isn't so slow.  I will also 
working the Lustre issue with Lustre developers.

> Have you seen the work we've done witn Darshan lately?  Darshan had 
> some bad experiences on Lustre a few years back, but Phil Carns and 
> Yushu Yao have really whipped it into shape for Hopper (see phil and 
> yushu's recent CUG paper).  It'd be nice to have Darshan on more Cray 
> systems.  It's been a huge asset on Argonne's Blue Gene machines.

I became aware of Darshan a while ago but until this week, I have not used it.  
I have now built it and will begin using it to see what else I can learn about 
the HDF5 performance from it.

Thanks for your comments.
David

> > On Tue, Sep 17, 2013 at 08:34:10AM -0500, Rob Latham wrote:
> > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> > > > separate files: 1.36 [s]
> > > > single file, 1 stripe: 133.6 [s] single file, best result: 17.2 
> > > > [s]
> > > > 
> > > > (I did multiple runs with various combinations of strip count 
> > > > and size, presenting the best results I have obtained.)
> > > > 
> > > > Increasing the number of stripes obviously helped a lot, but 
> > > > comparing with the separate-files strategy, the writing time is 
> > > > still more than ten times slower . Do you think it is "normal"?
> > > 
> > > It might be "normal" for Lustre, but it's not good.  I wish I had 
> > > more experience tuning the Cray/MPI-IO/Lustre stack, but I do not.
> > > The ADIOS folks report tuned-HDF5 to a single shared file runs 
> > > about 60% slower than ADIOS to multiple files, not 10x slower, so 
> > > it seems there is room for improvement.
> > > 
> > > I've asked them about the kinds of things "tuned HDF5" entails, 
> > > and they didn't know (!).
> > > 
> > > There are quite a few settings documented in the intro_mpi(3) man 
> > > page.  MPICH_MPIIO_CB_ALIGN will probably be the most important 
> > > thing you can try.  I'm sorry to report that in my limited 
> > > experience, the documentation and reality are sometimes out of 
> > > sync, especially with respect to which settings are default or not.
> > > 
> > > ==rob
> > > 
> > > > Thanks,
> > > > Daniel
> > > > 
> > > > Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> > > > >I've run some benchmark, where within an MPI program, each 
> > > > >process wrote
> > > > >3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the 
> > > > >following writing strategies:
> > > > >
> > > > >1) each process writes to its own file,
> > > > >2) each process writes to the same file to its own dataset,
> > > > >3) each process writes to the same file to a same dataset.
> > > > >
> > > > >I've tested 1)-3) for both fixed/chunked datasets (chunk size 
> > > > >1024), and I've tested 2)-3) for both independent/collective 
> > > > >options of the MPI driver. I've also used 3 different clusters 
> > > > >for measurements (all quite modern).
> > > > >
> > > > >As a result, the running (storage) times of the same-file strategy, 
> > > > >i.e.
> > > > >2) and 3), were of orders of magnitudes longer than the running 
> > > > >times of the separate-files strategy. For illustration:
> > > > >
> > > > >cluster #1, 512 MPI processes, each process stores 100 MB of 
> > > > >data, fixed data sets:
> > > > >
> > > > >1) separate files: 2.73 [s]
> > > > >2) single file, independent calls, separate data sets: 88.54[s]
> > > > >
> > > > >cluster #2, 256 MPI processes, each process stores 100 MB of 
> > > > >data, chunked data sets (chunk size 1024):
> > > > >
> > > > >1) separate files: 10.40 [s]
> > > > >2) single file, independent calls, shared data sets: 295 [s]
> > > > >3) single file, collective calls, shared data sets: 3275 [s]
> > > > >
> > > > >Any idea why the single-file strategy gives so poor writing 
> > > > >performance?
> > > > >
> > > > >Daniel
> > > > 
> > > > _______________________________________________
> > > > Hdf-forum is for HDF software users discussion.
> > > > [email protected]
> > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.
> > > > hdfgroup.org
> > > 
> > > --
> > > Rob Latham
> > > Mathematics and Computer Science Division Argonne National Lab, IL 
> > > USA
> > > 
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > [email protected]
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd
> > > fgroup.org
> > 
> 
> --
> Rob Latham
> Mathematics and Computer Science Division Argonne National Lab, IL USA
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
> up.org

-- 

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 




_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Reply via email to