After a lot of testing, my understanding is that : 1. if the data (that belong to each MPI proc.) must be interleaved in the file, then P-HDF5 (and MPI-IO) can reduce significantly the elapsed time spent for IO 2. if not (independent data written independently by each MPI proc.), then P-HDF5 / MPI-IO / sequential approaches are equivalent
A posteriori, this seems logical to me. Are there other situations where HDF5 may improve the IO speed-up (reduce elapsed time) ? Franck Le 2014-08-08 17:26, Rob Latham a écrit : > On 08/08/2014 03:27 AM, houssen wrote: >> In short : are there things to know / make sure of / be aware of to get >> good performance with P-HDF5 ? > > - turn on collective I/O. it's not enabled by default > > - HDF5 metadata might be a factor if you have very many small > datasets, but for most applications it's not important > > - consult your MPI library for any file-system specific tuning you > might be able to do. For example, Intel-MPI needs you to set an > environment variable before it will use any of the GPFS or Panasas > optimizations it has written. > > - be mindful of type conversions: if your data in memory is a 4-byte > float, but they are 8-byte doubles on disk, HDF5 will "break > collective" and do that I/O independently. > > >> To test this I wrote a MPI code. ... I expected to get better >> performance with MPI-IO and P-HDF5 than with the sequential approach. >> The spirit of this test code is very simple / basic (each MPI process >> writes his own block of data in the same file, or, in separate files in >> the sequential approach). > >> Note : in each case (sequential, MPI-IO, P-HDF5), when I say "write data >> in file", I mean writing big blocks / bunch of data at once (I do not >> write data one by one - I write the biggest block of data, but smaller >> than 2Gb, that is possible to write). >> Note : I tried with N = 1, 2, 4, 8, 16. > > in 2014, 16 is not very parallel. serial I/O has many benefits at > modest levels of parallelism: caching, mostly. > >> Note : I generated files (MPI-IO, P-HDF5) whose size scaled from 1Gb to >> 16 Gb (which looks like a "very big" file to me). > > that's adequate, yes > >> Note : I followed the P-HDF5 documentation (use H5P_FILE_ACCESS and >> H5P_DATASET_XFER property list + use hyperslab "by chunks") >> Note : the file system is "GPFS" (it has been installed by the cluster >> vendor : this is supposed to be ready to get performance out of P-HDF5 - >> I am an "application" guy that try to use HDF5, I am not an "admin sys" >> that would be familiar with complex related stuffs related to the file >> system) > > Now we are getting somewhere. > >> Note : I compiled the HDF5 package like this "./configure >> --enable-parallel". >> Note : I use CentOS + GNU compilers (for both HDF5 package and my test >> code) + hdf5-1.8.13 >> Note : I use mpic++ (not h5pxx compilers - actually I didn't get why >> HDF5 provides compilers) to compile my test code, is this a problem ? > > just makes it easier to pick up any libraries needed. I don't use > the wrappers, either, which means sometimes I need to figure out what > new library (like -ldl) HDF5 needs. > >> Any relevant clue / information would be appreciated. If what I observe >> is logical I would just understand why, and, how / when it is possible >> to get performance out of P-HDF5. I just would like to get some logic >> out of this. > > If you are using GPFS, there is one optimization that goes a long way > towards improving performance: aligning writes to file system block > boundaries. See this email from a few weeks ago: > > http://mail.lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2014-July/007963.html > > ==rob > >> >> Thanks for help, >> >> FH >> >> PS : I can give more information and the code, if needed (?) >> >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >> Twitter: https://twitter.com/hdf5 >>
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
