Hi Nikhil, I am in the NERSC Analytics group and have done extensive benchmarking and testing of I/O on Franklin. We have been working in collaboration with the HDF Group for almost a year now to improve parallel I/O performance on Lustre file systems, with Franklin as one of our primary test machines.
The root-only write scenario you describe will always lead to serialization, because you have only one compute node communicating with the I/O servers (called "OSTs" in Lustre terminology). In your parallel scenario (which is called "independent" parallel I/O, as opposed to "collective" parallel I/O which I will describe in a bit), you are probably experiencing serialization because you are using the default Lustre striping on Franklin, which uses only 2 stripes. This means that all 200 of your processors are communicating with only 2 OSTs, out of 48 available. You can find more about Lustre striping from this page: http://www.nersc.gov/nusers/systems/franklin/io.php If you increase the stripe count using stripe_large myOutputDir/ (sets the striping on the directory and any new files created in it) or stripe_medium specificFile.h5 (this touches the file before your program runs, but needs to be done for each output file) you will use all 48 OSTs and should see improved performance in parallel mode. From your plot, it looks like you are getting around 500-1100MB/s write bandwidth out of the ~12GB/s peak available on Franklin. A further optimization that may help is to enable "collective" mode, which creates a one-to-one mapping between a subset of your processors and the OSTs, and involves a communication step similar to the one you implemented for the root-only scenario. The other processors send their data to the subset, and the subset writes the data to disk (this is called "two-phase I/O" or "collective buffering"). The additional coordination achieved by collective I/O can improve performance for many I/O patterns. You can find more details about this in the NERSC parallel I/O tutorial: http://www.nersc.gov/nusers/help/tutorials/io/ including some code snippets for how to set this up in HDF5. It also summarizes some of the improvements we have been working on, which will soon be rolled into the public release of the HDF5 library. Let me know if you have more questions, or want to continue this discussion offline. I would be glad to talk with you further or to help you modify your code or run more I/O tests. Mark On Thu, Jul 22, 2010 at 12:42 AM, Nikhil Laghave <[email protected]> wrote: > Hello All, > > I have a generic question regarding the comparison of seq. binary and > parallel HDF5 for I/O of large files. > > I am using the franklin supercomputer at NERSC for my experiments. The > datasets/file size are between 55GB and 111GB which are being written by a > single processor in case of seq. binary. In this case, several(~200) > processors send the data to a single root processor, which does the I/O disk. > So, basically only 1 processor is doing the I/O to disk. > > In case of parallel HDF5, all the ~200 processors do the I/O to disk > independently without communication to the root processor. > > However, on the LUSTRE file system, there are file locks leading to all the > ~200 write operations to be serialized in actuality. > > Now when I compare the performance of seq. binary vs parallel HDF5, the only > difference is that in case of seq. binary, there is communication overhead > which according to my measurements are not a big overhead. In that case since > both the writes(seq. binary & parallel HDF5) are sequential/serialized, I > expected the performance to be similar. However, in my experiments, parallel > HDF5 outperforms seq. binary significantly. I do not understand why this so > since even parallel HDF5 write operations are serialized. The plot attached > explains my doubt. > > Please can someone explain to me why parallel HDF5 outperforms seq. binary > writes even though parallel HDF5 writes are also serialized. Your inputs are > greatly appreciated. Thank You. > > Nikhil > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
