Hi Mark, Thanks for your reply.
Actually my data layout is contiguous, therefore there are no optimizations that can be done using collective IO. Basically I am writing a 1D array to disk which is distributed among many processors and each processor holding contiguous data. I will try your striping suggestions to see how much the performance improves. One question I had was regarding locks in the lustre file system. As per my know, lustre fs puts write locks on each writer processor thus serializing the parallel write operations. Has this changed or can I actually have parallel writes on one file simultaneously? Thank You. Nikhil On Jul 22, 2010, at 12:22 PM, Mark Howison wrote: > Hi Nikhil, > > I am in the NERSC Analytics group and have done extensive benchmarking > and testing of I/O on Franklin. We have been working in collaboration > with the HDF Group for almost a year now to improve parallel I/O > performance on Lustre file systems, with Franklin as one of our > primary test machines. > > The root-only write scenario you describe will always lead to > serialization, because you have only one compute node communicating > with the I/O servers (called "OSTs" in Lustre terminology). > > In your parallel scenario (which is called "independent" parallel I/O, > as opposed to "collective" parallel I/O which I will describe in a > bit), you are probably experiencing serialization because you are > using the default Lustre striping on Franklin, which uses only 2 > stripes. This means that all 200 of your processors are communicating > with only 2 OSTs, out of 48 available. You can find more about Lustre > striping from this page: > > http://www.nersc.gov/nusers/systems/franklin/io.php > > If you increase the stripe count using > > stripe_large myOutputDir/ (sets the striping on the directory and any > new files created in it) > > or > > stripe_medium specificFile.h5 (this touches the file before your > program runs, but needs to be done for each output file) > > you will use all 48 OSTs and should see improved performance in > parallel mode. From your plot, it looks like you are getting around > 500-1100MB/s write bandwidth out of the ~12GB/s peak available on > Franklin. > > A further optimization that may help is to enable "collective" mode, > which creates a one-to-one mapping between a subset of your processors > and the OSTs, and involves a communication step similar to the one you > implemented for the root-only scenario. The other processors send > their data to the subset, and the subset writes the data to disk (this > is called "two-phase I/O" or "collective buffering"). The additional > coordination achieved by collective I/O can improve performance for > many I/O patterns. You can find more details about this in the NERSC > parallel I/O tutorial: > > http://www.nersc.gov/nusers/help/tutorials/io/ > > including some code snippets for how to set this up in HDF5. It also > summarizes some of the improvements we have been working on, which > will soon be rolled into the public release of the HDF5 library. > > Let me know if you have more questions, or want to continue this > discussion offline. I would be glad to talk with you further or to > help you modify your code or run more I/O tests. > > Mark > > On Thu, Jul 22, 2010 at 12:42 AM, Nikhil Laghave > <[email protected]> wrote: >> Hello All, >> >> I have a generic question regarding the comparison of seq. binary and >> parallel HDF5 for I/O of large files. >> >> I am using the franklin supercomputer at NERSC for my experiments. The >> datasets/file size are between 55GB and 111GB which are being written by a >> single processor in case of seq. binary. In this case, several(~200) >> processors send the data to a single root processor, which does the I/O >> disk. So, basically only 1 processor is doing the I/O to disk. >> >> In case of parallel HDF5, all the ~200 processors do the I/O to disk >> independently without communication to the root processor. >> >> However, on the LUSTRE file system, there are file locks leading to all the >> ~200 write operations to be serialized in actuality. >> >> Now when I compare the performance of seq. binary vs parallel HDF5, the only >> difference is that in case of seq. binary, there is communication overhead >> which according to my measurements are not a big overhead. In that case >> since both the writes(seq. binary & parallel HDF5) are >> sequential/serialized, I expected the performance to be similar. However, in >> my experiments, parallel HDF5 outperforms seq. binary significantly. I do >> not understand why this so since even parallel HDF5 write operations are >> serialized. The plot attached explains my doubt. >> >> Please can someone explain to me why parallel HDF5 outperforms seq. binary >> writes even though parallel HDF5 writes are also serialized. Your inputs are >> greatly appreciated. Thank You. >> >> Nikhil >> >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org >> >> > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
