Just another voice to join the choir... Our BGQ using GPFS was switched on last week and my HDF5 performance has been around 10% of IOR - which compares to the cray running lustre where we get 60% IOR or thereabouts.
This morning, I did some poking around and found that the cmake based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to be set to false and no GPFS optimizations are compiled in (libgpfs is not detected). Having tweaked that, you can imagine my happiness when I recompiled everything and now I'm getting even worse Bandwidth. In fact if I enable collective IO, the app coredumps on me, so the situations is worse than I had feared. I'm using too much memory in my test I suspect and collectives are pushing me over the limit. The only test I can run with collective enabled is the one that uses only one rank and writes 16MB! Looks like I'm going to have to spend quite a bit more time looking at this. If anyone else is making tweaks to the hdf5 source, please let me know as I don't want to duplicate what anyone else is doing, but I'll be happy to help track down issues. Rob : you mentioned some fcntl functions were deprecated etc. do I need to remove these to stop the coredumps? (I'm very much hoping something has gone wrong with my tests because the performance is shockingly bad ... ) (NB. my Version is 1.8.12-snap17) JB > -----Original Message----- > From: Hdf-forum [mailto:[email protected]] On Behalf > Of Daniel Langr > Sent: 20 September 2013 13:46 > To: HDF Users Discussion List > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single > (shared) file > > Rob, > > thanks a lot for hints. I will look at the suggested option and try some > experiments with it :). > > Daniel > > > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a): > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote: > >> separate files: 1.36 [s] > >> single file, 1 stripe: 133.6 [s] > >> single file, best result: 17.2 [s] > >> > >> (I did multiple runs with various combinations of strip count and > >> size, presenting the best results I have obtained.) > >> > >> Increasing the number of stripes obviously helped a lot, but > >> comparing with the separate-files strategy, the writing time is still > >> more than ten times slower . Do you think it is "normal"? > > > > It might be "normal" for Lustre, but it's not good. I wish I had more > > experience tuning the Cray/MPI-IO/Lustre stack, but I do not. > > The ADIOS folks report tuned-HDF5 to a single shared file runs about > > 60% slower than ADIOS to multiple files, not 10x slower, so it seems > > there is room for improvement. > > > > I've asked them about the kinds of things "tuned HDF5" entails, and > > they didn't know (!). > > > > There are quite a few settings documented in the intro_mpi(3) man > > page. MPICH_MPIIO_CB_ALIGN will probably be the most important thing > > you can try. I'm sorry to report that in my limited experience, the > > documentation and reality are sometimes out of sync, especially with > > respect to which settings are default or not. > > > > ==rob > > > >> Thanks, > >> Daniel > >> > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a): > >>> I've run some benchmark, where within an MPI program, each process > >>> wrote > >>> 3 plain 1D arrays to 3 datasets of an HDF5 file. I've used the > >>> following writing strategies: > >>> > >>> 1) each process writes to its own file, > >>> 2) each process writes to the same file to its own dataset, > >>> 3) each process writes to the same file to a same dataset. > >>> > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size 1024), > >>> and I've tested 2)-3) for both independent/collective options of the > >>> MPI driver. I've also used 3 different clusters for measurements > >>> (all quite modern). > >>> > >>> As a result, the running (storage) times of the same-file strategy, i.e. > >>> 2) and 3), were of orders of magnitudes longer than the running > >>> times of the separate-files strategy. For illustration: > >>> > >>> cluster #1, 512 MPI processes, each process stores 100 MB of data, > >>> fixed data sets: > >>> > >>> 1) separate files: 2.73 [s] > >>> 2) single file, independent calls, separate data sets: 88.54[s] > >>> > >>> cluster #2, 256 MPI processes, each process stores 100 MB of data, > >>> chunked data sets (chunk size 1024): > >>> > >>> 1) separate files: 10.40 [s] > >>> 2) single file, independent calls, shared data sets: 295 [s] > >>> 3) single file, collective calls, shared data sets: 3275 [s] > >>> > >>> Any idea why the single-file strategy gives so poor writing performance? > >>> > >>> Daniel > >> > >> _______________________________________________ > >> Hdf-forum is for HDF software users discussion. > >> [email protected] > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgr > >> oup.org > > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
