On Fri, Sep 27, 2013 at 02:58:32PM +0000, Biddiscombe, John A. wrote: > The difference that one flag can make is quite impressive. People need to > know this!
Oh John Oh John.... I cannot tell you how angry that flag makes me! 'bglockless:' was supposed to be a short-term hack. It was written for the PVFS file system (which did not support fcntl()-style locks, or any locks at all for that matter). Then we found out it helped GPFS on BlueGene too. I'm going to have to just sit down for a couple-five days and send IBM a patch removing all the locks from the default driver, and telling anyone who wants to run MPI-IO to an NFS file system on a blue gene to take a hike. Thanks for the graphs. I was surprised to see that fewer than 8 cores per node resulted in slightly *worse* performance for collective I/O. ==rob > [cid:[email protected]] > > > > > -----Original Message----- > > > From: Hdf-forum [mailto:[email protected]] On Behalf > > > Of Biddiscombe, John A. > > > Sent: 20 September 2013 21:47 > > > To: HDF Users Discussion List > > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using single > > > (shared) file > > > > > > Rob > > > > > > Thanks for the info regarding settings and IOR config etc I wil go through > > that > > > in detail over the next few days. > > > > > > I plan on taking a crash course in debugging on BG/Q ASAP, my skills in this > > > regard are little better than printf and I'm going to need to do some > > profiling > > > and stepping through code to see what's going on inside hdf5. > > > > > > Just FYI. I run a simple test which writes data out and I set it going > > using this > > > loop, which generates slurm submission scripts for me and passes a ton of > > > options to my test. So the scripts run jobs on all node counts and > > > procspercore count from 1-64. Since the machine is not yet in production, I > > > can get a lot of this done now. > > > > > > for NODES in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 do > > > for NPERNODE in 1 2 4 8 16 32 64 > > > do > > > write_script (...options) > > > done > > > done > > > > > > cmake - yes, I'm also compiling with clang, I'm not trying to make anything > > > easy for myself here :) > > > > > > JB > > > > > > > -----Original Message----- > > > > From: Hdf-forum [mailto:[email protected]] On > > > > Behalf Of Rob Latham > > > > Sent: 20 September 2013 17:03 > > > > To: HDF Users Discussion List > > > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using > > > > single > > > > (shared) file > > > > > > > > On Fri, Sep 20, 2013 at 01:34:24PM +0000, Biddiscombe, John A. wrote: > > > > > This morning, I did some poking around and found that the cmake > > > > > based configure of hdf has a nasty bug that causes H5_HAVE_GPFS to > > > > > be set to false and no GPFS optimizations are compiled in (libgpfs > > > > > is not detected). Having tweaked that, you can imagine my happiness > > > > > when I recompiled everything and now I'm getting even worse > > > Bandwidth. > > > > > > > > Thanks for the report on those hints. HDF5 contains, outside of > > > > gpfs-specific benchmarks, one of the few implementations of all the > > > > gpfs_fcntl() tuning parameters. Given your experience, probably best > > > > to turn off those hints. > > > > > > > > Also, cmake works on bluegene? wow. Don't forget that bluegene > > > > requires cross compliation. > > > > > > > > > In fact if I enable collective IO, the app coredumps on me, so the > > > > > situations is worse than I had feared. I'm using too much memory in > > > > > my test I suspect and collectives are pushing me over the limit. The > > > > > only test I can run with collective enabled is the one that uses > > > > > only one rank and writes 16MB! > > > > > > > > How many processes per node are you using on your BGQ? if you are > > > > loading up with 64 procs per node, that will give each one about > > > > 200-230 MiB of scratch space. > > > > > > > > I wonder if you have built some or all of your hdf5 library for the > > > > front end nodes, and some or none for the compute nodes? > > > > > > > > How many processes are you running here? > > > > > > > > A month back I ran some one-rack experiments: > > > > > > > https://www.dropbox.com/s/89wmgmf1b1ung0s/mira_hinted_api_compar > > > > e.png > > > > > > > > Here's my IOR config file. Note two tuning parameters here: > > > > - "bg_nodes_pset", which showed up on Blue Gene /L, is way way too low > > > > for Blue Gene /Q > > > > - the 'bglockless' prefix is "robl's secret turbo button". it was fun > > > > to pull that rabbit out of the hat... for the first few years. > > > > (it's not the default because in one specific case performance is > > > > shockingly poor). > > > > > > > > IOR START > > > > numTasks=65536 > > > > repetitions=3 > > > > reorderTasksConstant=1024 > > > > fsync=1 > > > > transferSize=6M > > > > blockSize=6M > > > > collective=1 > > > > showHints=1 > > > > hintsFileName=IOR-hints-bg_nodes_pset.64 > > > > > > > > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io- > > > > api.mpi > > > > api=MPIIO > > > > RUN > > > > api=HDF5 > > > > > > > > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io- > > > > api.h5 > > > > RUN > > > > api=NCMPI > > > > > > > > testFile=bglockless:/gpfs/mira-fs0/projects/SSSPPg/robl/ior-shared/io- > > > > api.nc > > > > RUN > > > > IOR STOP > > > > > > > > > > > > > Rob : you mentioned some fcntl functions were deprecated etc. do I > > > > > need to remove these to stop the coredumps? (I'm very much hoping > > > > > something has gone wrong with my tests because the performance is > > > > > shockingly bad ... ) (NB. my Version is 1.8.12-snap17) > > > > > > > > Unless you are running BGQ system software driver V1R2M1, the > > > > gpfs_fcntl hints do not get forwarded to storage, and return an error. > > > > It's possible HDF5 responds to that error with a core dump? > > > > > > > > ==rob > > > > > > > > > > > > > JB > > > > > > > > > > > -----Original Message----- From: Hdf-forum > > > > > > [mailto:[email protected]] On Behalf Of Daniel > > > > > > Langr Sent: 20 September 2013 13:46 To: HDF Users Discussion List > > > > > > Subject: Re: [Hdf-forum] Very poor performance of pHDF5 when using > > > > > > single (shared) file > > > > > > > > > > > > Rob, > > > > > > > > > > > > thanks a lot for hints. I will look at the suggested option and > > > > > > try some experiments with it :). > > > > > > > > > > > > Daniel > > > > > > > > > > > > > > > > > > > > > > > > Dne 17. 9. 2013 15:34, Rob Latham napsal(a): > > > > > > > On Tue, Sep 17, 2013 at 11:15:02AM +0200, Daniel Langr wrote: > > > > > > >> separate files: 1.36 [s] single file, 1 stripe: 133.6 [s] > > > > > > >> single file, best result: 17.2 [s] > > > > > > >> > > > > > > >> (I did multiple runs with various combinations of strip count > > > > > > >> and size, presenting the best results I have obtained.) > > > > > > >> > > > > > > >> Increasing the number of stripes obviously helped a lot, but > > > > > > >> comparing with the separate-files strategy, the writing time is > > > > > > >> still more than ten times slower . Do you think it is "normal"? > > > > > > > > > > > > > > It might be "normal" for Lustre, but it's not good. I wish I > > > > > > > had more experience tuning the Cray/MPI-IO/Lustre stack, but I do > > > not. > > > > > > > The ADIOS folks report tuned-HDF5 to a single shared file runs > > > > > > > about 60% slower than ADIOS to multiple files, not 10x slower, > > > > > > > so it seems there is room for improvement. > > > > > > > > > > > > > > I've asked them about the kinds of things "tuned HDF5" entails, > > > > > > > and they didn't know (!). > > > > > > > > > > > > > > There are quite a few settings documented in the intro_mpi(3) > > > > > > > man page. MPICH_MPIIO_CB_ALIGN will probably be the most > > > > > > > important thing you can try. I'm sorry to report that in my > > > > > > > limited experience, the documentation and reality are sometimes > > > > > > > out of sync, especially with respect to which settings are > > > > > > > default or not. > > > > > > > > > > > > > > ==rob > > > > > > > > > > > > > >> Thanks, Daniel > > > > > > >> > > > > > > >> Dne 30. 8. 2013 16:05, Daniel Langr napsal(a): > > > > > > >>> I've run some benchmark, where within an MPI program, each > > > > > > >>> process wrote 3 plain 1D arrays to 3 datasets of an HDF5 file. > > > > > > >>> I've used the following writing strategies: > > > > > > >>> > > > > > > >>> 1) each process writes to its own file, 2) each process writes > > > > > > >>> to the same file to its own dataset, 3) each process writes to > > > > > > >>> the same file to a same dataset. > > > > > > >>> > > > > > > >>> I've tested 1)-3) for both fixed/chunked datasets (chunk size > > > > > > >>> 1024), and I've tested 2)-3) for both independent/collective > > > > > > >>> options of the MPI driver. I've also used 3 different clusters > > > > > > >>> for measurements (all quite modern). > > > > > > >>> > > > > > > >>> As a result, the running (storage) times of the same-file > > > > > > >>> strategy, i.e. 2) and 3), were of orders of magnitudes longer > > > > > > >>> than the running times of the separate-files strategy. For > > > > > > >>> illustration: > > > > > > >>> > > > > > > >>> cluster #1, 512 MPI processes, each process stores 100 MB of > > > > > > >>> data, fixed data sets: > > > > > > >>> > > > > > > >>> 1) separate files: 2.73 [s] 2) single file, independent calls, > > > > > > >>> separate data sets: 88.54[s] > > > > > > >>> > > > > > > >>> cluster #2, 256 MPI processes, each process stores 100 MB of > > > > > > >>> data, chunked data sets (chunk size 1024): > > > > > > >>> > > > > > > >>> 1) separate files: 10.40 [s] 2) single file, independent > > > > > > >>> calls, shared data sets: 295 [s] 3) single file, collective > > > > > > >>> calls, shared data sets: 3275 [s] > > > > > > >>> > > > > > > >>> Any idea why the single-file strategy gives so poor writing > > > > > > >>> performance? > > > > > > >>> > > > > > > >>> Daniel > > > > > > >> > > > > > > >> _______________________________________________ Hdf- > > > > forum is for > > > > > > >> HDF software users discussion. > > > > > > >> [email protected]<mailto:[email protected]> > > > > > > >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists > > > > > > >> .h > > > > > > >> dfgr > > > > > > >> oup.org > > > > > > > > > > > > > > > > > > > _______________________________________________ Hdf- > > > forum > > > > is for HDF > > > > > > software users discussion. > > > > > [email protected]<mailto:[email protected]> > > > > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hd > > > > > > fg > > > > > > roup.org > > > > > > > > > > _______________________________________________ Hdf-forum > > > is > > > > for HDF > > > > > software users discussion. > > > > [email protected]<mailto:[email protected]> > > > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfg > > > > > ro > > > > > up.org > > > > > > > > -- > > > > Rob Latham > > > > Mathematics and Computer Science Division Argonne National Lab, IL USA > > > > > > > > _______________________________________________ > > > > Hdf-forum is for HDF software users discussion. > > > > [email protected]<mailto:[email protected]> > > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgro > > > > up.org > > > > > > _______________________________________________ > > > Hdf-forum is for HDF software users discussion. > > > [email protected]<mailto:[email protected]> > > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
