Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Rob Latham Fri, 27 Sep 2013 08:16:51 -0700

On Fri, Sep 27, 2013 at 02:58:32PM +0000, Biddiscombe, John A. wrote:
> The difference that one flag can make is quite impressive. People need to 
> know this!


Oh John Oh John.... I cannot tell you how angry that flag makes me!

'bglockless:' was supposed to be a short-term hack.  It was written
for the PVFS file system (which did not support fcntl()-style locks,
or any locks at all for that matter).  Then we found out it helped
GPFS on BlueGene too.  

I'm going to have to just sit down for a couple-five days and send IBM
a patch removing all the locks from the default driver, and telling
anyone who wants to run MPI-IO to an NFS file system on a blue gene to
take a hike.

Thanks for the graphs.  I was surprised to see that fewer than 8 cores
per node resulted in slightly *worse* performance for collective I/O.  

==rob

> [cid:[email protected]]
> 
> 
> 
> > -----Original Message-----
> 
> > From: Hdf-forum [mailto:[email protected]] On Behalf
> 
> > Of Biddiscombe, John A.
> 
> > Sent: 20 September 2013 21:47
> 
> > To: HDF Users Discussion List
> 
> > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single
> 
> > (shared) file
> 
> >
> 
> > Rob
> 
> >
> 
> > Thanks for the info regarding settings and IOR config etc I wil go through 
> > that
> 
> > in detail over the next few days.
> 
> >
> 
> > I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this
> 
> > regard are little better than printf and I'm going to need to do some 
> > profiling
> 
> > and stepping through code to see what's going on inside hdf5.
> 
> >
> 
> > Just FYI. I run a simple test which writes data out and I set it going 
> > using this
> 
> > loop, which generates slurm submission scripts for me and passes a ton of
> 
> > options to my test. So the scripts run jobs on all node counts and
> 
> > procspercore count from 1-64. Since the machine is not yet in production, I
> 
> > can get a lot of this done now.
> 
> >
> 
> > for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do
> 
> >   for NPERNODE in 1 2 4 8 16 32 64
> 
> >   do
> 
> >     write_script (...options)
> 
> >   done
> 
> > done
> 
> >
> 
> > cmake - yes, I'm also compiling with clang, I'm not trying to make anything
> 
> > easy for myself here :)
> 
> >
> 
> > JB
> 
> >
> 
> > > -----Original Message-----
> 
> > > From: Hdf-forum [mailto:[email protected]] On
> 
> > > Behalf Of Rob Latham
> 
> > > Sent: 20 September 2013 17:03
> 
> > > To: HDF Users Discussion List
> 
> > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> 
> > > single
> 
> > > (shared) file
> 
> > >
> 
> > > On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote:
> 
> > > > This morning, I did some poking around and found that the cmake
> 
> > > > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to
> 
> > > > be set to false and no GPFS optimizations are compiled in (libgpfs
> 
> > > > is not detected). Having tweaked that, you can imagine my happiness
> 
> > > > when I recompiled everything and now I'm getting even worse
> 
> > Bandwidth.
> 
> > >
> 
> > > Thanks for the report on those hints.  HDF5 contains, outside of
> 
> > > gpfs-specific benchmarks, one of the few implementations of all the
> 
> > > gpfs_fcntl() tuning parameters.   Given your experience, probably best
> 
> > > to turn off those hints.
> 
> > >
> 
> > > Also, cmake works on bluegene?  wow.  Don't forget that bluegene
> 
> > > requires cross compliation.
> 
> > >
> 
> > > > In fact if I enable collective IO, the app coredumps on me, so the
> 
> > > > situations is worse than I had feared. I'm using too much memory in
> 
> > > > my test I suspect and collectives are pushing me over the limit. The
> 
> > > > only test I can run with collective enabled is the one that uses
> 
> > > > only one rank and writes 16MB!
> 
> > >
> 
> > > How many processes per node are you using on your BGQ?  if you are
> 
> > > loading up with 64 procs per node, that will give each one about
> 
> > > 200-230 MiB of scratch space.
> 
> > >
> 
> > > I wonder if you have built some or all of your hdf5 library for the
> 
> > > front end nodes, and some or none for the compute nodes?
> 
> > >
> 
> > > How many processes are you running here?
> 
> > >
> 
> > > A month back I ran some one-rack experiments:
> 
> > >
> 
> > https://www.dropbox.com/s/89wmgmf1b1ung0s/mira_hinted_api_compar
> 
> > > e.png
> 
> > >
> 
> > > Here's my IOR config file.  Note two tuning parameters here:
> 
> > > - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low
> 
> > >   for Blue Gene /Q
> 
> > > - the 'bglockless' prefix is "robl's secret turbo button".  it was fun
> 
> > >   to pull that rabbit out of the hat... for the first few years.
> 
> > >   (it's not the default because in one specific case performance is
> 
> > >   shockingly poor).
> 
> > >
> 
> > > IOR START
> 
> > >         numTasks=65536
> 
> > >         repetitions=3
> 
> > >         reorderTasksConstant=1024
> 
> > >         fsync=1
> 
> > >         transferSize=6M
> 
> > >         blockSize=6M
> 
> > >         collective=1
> 
> > >         showHints=1
> 
> > >         hintsFileName=IOR-hints-bg_nodes_pset.64
> 
> > >
> 
> > > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> 
> > > api.mpi
> 
> > >         api=MPIIO
> 
> > >         RUN
> 
> > >         api=HDF5
> 
> > >
> 
> > > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> 
> > > api.h5
> 
> > >         RUN
> 
> > >         api=NCMPI
> 
> > >
> 
> > > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io-
> 
> > > api.nc
> 
> > >         RUN
> 
> > > IOR STOP
> 
> > >
> 
> > >
> 
> > > > Rob : you mentioned some fcntl functions were deprecated etc. do I
> 
> > > > need to remove these to stop the coredumps? (I'm very much hoping
> 
> > > > something has gone wrong with my tests because the performance is
> 
> > > > shockingly bad  ... ) (NB. my Version is 1.8.12-snap17)
> 
> > >
> 
> > > Unless you are running BGQ system software driver V1R2M1, the
> 
> > > gpfs_fcntl hints do not get forwarded to storage, and return an error.
> 
> > > It's possible HDF5 responds to that error with a core dump?
> 
> > >
> 
> > > ==rob
> 
> > >
> 
> > >
> 
> > > > JB
> 
> > > >
> 
> > > > > -----Original Message----- From: Hdf-forum
> 
> > > > > [mailto:[email protected]] On Behalf Of Daniel
> 
> > > > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List
> 
> > > > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using
> 
> > > > > single (shared) file
> 
> > > > >
> 
> > > > > Rob,
> 
> > > > >
> 
> > > > > thanks a lot for hints. I will look at the suggested option and
> 
> > > > > try some experiments with it :).
> 
> > > > >
> 
> > > > > Daniel
> 
> > > > >
> 
> > > > >
> 
> > > > >
> 
> > > > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a):
> 
> > > > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote:
> 
> > > > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s]
> 
> > > > > >> single file, best result: 17.2 [s]
> 
> > > > > >>
> 
> > > > > >> (I did multiple runs with various combinations of strip count
> 
> > > > > >> and size, presenting the best results I have obtained.)
> 
> > > > > >>
> 
> > > > > >> Increasing the number of stripes obviously helped a lot, but
> 
> > > > > >> comparing with the separate-files strategy, the writing time is
> 
> > > > > >> still more than ten times slower . Do you think it is "normal"?
> 
> > > > > >
> 
> > > > > > It might be "normal" for Lustre, but it's not good.  I wish I
> 
> > > > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do
> 
> > not.
> 
> > > > > > The ADIOS folks report tuned-HDF5 to a single shared file runs
> 
> > > > > > about 60% slower than ADIOS to multiple files, not 10x slower,
> 
> > > > > > so it seems there is room for improvement.
> 
> > > > > >
> 
> > > > > > I've asked them about the kinds of things "tuned HDF5" entails,
> 
> > > > > > and they didn't know (!).
> 
> > > > > >
> 
> > > > > > There are quite a few settings documented in the intro_mpi(3)
> 
> > > > > > man page.  MPICH_MPIIO_CB_ALIGN will probably be the most
> 
> > > > > > important thing you can try.  I'm sorry to report that in my
> 
> > > > > > limited experience, the documentation and reality are sometimes
> 
> > > > > > out of sync, especially with respect to which settings are
> 
> > > > > > default or not.
> 
> > > > > >
> 
> > > > > > ==rob
> 
> > > > > >
> 
> > > > > >> Thanks, Daniel
> 
> > > > > >>
> 
> > > > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a):
> 
> > > > > >>> I've run some benchmark, where within an MPI program, each
> 
> > > > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file.
> 
> > > > > >>> I've used the following writing strategies:
> 
> > > > > >>>
> 
> > > > > >>> 1) each process writes to its own file, 2) each process writes
> 
> > > > > >>> to the same file to its own dataset, 3) each process writes to
> 
> > > > > >>> the same file to a same dataset.
> 
> > > > > >>>
> 
> > > > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size
> 
> > > > > >>> 1024), and I've tested 2)-3) for both independent/collective
> 
> > > > > >>> options of the MPI driver. I've also used 3 different clusters
> 
> > > > > >>> for measurements (all quite modern).
> 
> > > > > >>>
> 
> > > > > >>> As a result, the running (storage) times of the same-file
> 
> > > > > >>> strategy, i.e.  2) and 3), were of orders of magnitudes longer
> 
> > > > > >>> than the running times of the separate-files strategy. For
> 
> > > > > >>> illustration:
> 
> > > > > >>>
> 
> > > > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of
> 
> > > > > >>> data, fixed data sets:
> 
> > > > > >>>
> 
> > > > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls,
> 
> > > > > >>> separate data sets: 88.54[s]
> 
> > > > > >>>
> 
> > > > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of
> 
> > > > > >>> data, chunked data sets (chunk size 1024):
> 
> > > > > >>>
> 
> > > > > >>> 1) separate files: 10.40 [s] 2) single file, independent
> 
> > > > > >>> calls, shared data sets: 295 [s] 3) single file, collective
> 
> > > > > >>> calls, shared data sets: 3275 [s]
> 
> > > > > >>>
> 
> > > > > >>> Any idea why the single-file strategy gives so poor writing
> 
> > > > > >>> performance?
> 
> > > > > >>>
> 
> > > > > >>> Daniel
> 
> > > > > >>
> 
> > > > > >> _______________________________________________ Hdf-
> 
> > > forum is for
> 
> > > > > >> HDF software users discussion.
> 
> > > > > >> [email protected]<mailto:[email protected]>
> 
> > > > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists
> 
> > > > > >> .h
> 
> > > > > >> dfgr
> 
> > > > > >> oup.org
> 
> > > > > >
> 
> > > > >
> 
> > > > > _______________________________________________ Hdf-
> 
> > forum
> 
> > > is for HDF
> 
> > > > > software users discussion.  
> > > > > [email protected]<mailto:[email protected]>
> 
> > > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd
> 
> > > > > fg
> 
> > > > > roup.org
> 
> > > >
> 
> > > > _______________________________________________ Hdf-forum
> 
> > is
> 
> > > for HDF
> 
> > > > software users discussion.  
> > > > [email protected]<mailto:[email protected]>
> 
> > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg
> 
> > > > ro
> 
> > > > up.org
> 
> > >
> 
> > > --
> 
> > > Rob Latham
> 
> > > Mathematics and Computer Science Division Argonne National Lab, IL USA
> 
> > >
> 
> > > _______________________________________________
> 
> > > Hdf-forum is for HDF software users discussion.
> 
> > > [email protected]<mailto:[email protected]>
> 
> > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro
> 
> > > up.org
> 
> >
> 
> > _______________________________________________
> 
> > Hdf-forum is for HDF software users discussion.
> 
> > [email protected]<mailto:[email protected]>
> 
> > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org



> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org


-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Re: [Hdf-forum] Very poor performance of pHDF5 when using single (shared) file

Reply via email to