Hi Nikhil, Lustre locks files on a per-stripe basis. The default striping on Franklin is a stripe count of 2 and stripe size of 4MB. So a file is broken into 4MB regions that alternate between two OSTs. If two or more processors try to write to the same 4MB, Lustre invokes a lock manager to serialize access, which can be very costly.
You can also experience "self-contention" in independent mode. If you have 200 processors writing to 48 OSTs, each OST is servicing several of your processors. Maybe this is what you mean by "write locks on each writer processor"? Are you writing out the same amount of contiguous data from each processor? If so, you may want to continue to use independent mode in combination with the 'chunking' and 'alignment' features of HDF5 (described in the NERSC I/O tutorial). This will allow you to guarantee that each processor writes to an offset in the shared file that is a multiple of the stripe size, so that there are no lock contentions. You can also go a step further and set the stripe size to the size of the contiguous data you are writing. This effectively creates a file-per-processor write pattern: each processor writes to a disjoint region of the shared file and to only one stripe/OST. If the processors are writing different amounts of data, you probably want to use collective I/O. Even though you are right that the greatest benefits of collective buffering are for non-contiguous data, the collective buffering algorithm in the Cray MPI-IO library is now Lustre aware and will break your I/O pattern up into stripe-aligned writes, and the number of writers will be set to the stripe count. Again, this will guarantee no lock contention and will setup a pattern that resembles like file-per-processor from the OSTs' point of view. Mark On Thu, Jul 22, 2010 at 10:34 AM, Nikhil Laghave <[email protected]> wrote: > Hi Mark, > > Thanks for your reply. > > Actually my data layout is contiguous, therefore there are no optimizations > that can be done using collective IO. Basically I am writing a 1D array to > disk which is distributed among many processors and each processor holding > contiguous data. I will try your striping suggestions to see how much the > performance improves. > > One question I had was regarding locks in the lustre file system. As per my > know, lustre fs puts write locks on each writer processor thus serializing > the parallel write operations. Has this changed or can I actually have > parallel writes on one file simultaneously? Thank You. > > Nikhil > > On Jul 22, 2010, at 12:22 PM, Mark Howison wrote: > >> Hi Nikhil, >> >> I am in the NERSC Analytics group and have done extensive benchmarking >> and testing of I/O on Franklin. We have been working in collaboration >> with the HDF Group for almost a year now to improve parallel I/O >> performance on Lustre file systems, with Franklin as one of our >> primary test machines. >> >> The root-only write scenario you describe will always lead to >> serialization, because you have only one compute node communicating >> with the I/O servers (called "OSTs" in Lustre terminology). >> >> In your parallel scenario (which is called "independent" parallel I/O, >> as opposed to "collective" parallel I/O which I will describe in a >> bit), you are probably experiencing serialization because you are >> using the default Lustre striping on Franklin, which uses only 2 >> stripes. This means that all 200 of your processors are communicating >> with only 2 OSTs, out of 48 available. You can find more about Lustre >> striping from this page: >> >> http://www.nersc.gov/nusers/systems/franklin/io.php >> >> If you increase the stripe count using >> >> stripe_large myOutputDir/ (sets the striping on the directory and any >> new files created in it) >> >> or >> >> stripe_medium specificFile.h5 (this touches the file before your >> program runs, but needs to be done for each output file) >> >> you will use all 48 OSTs and should see improved performance in >> parallel mode. From your plot, it looks like you are getting around >> 500-1100MB/s write bandwidth out of the ~12GB/s peak available on >> Franklin. >> >> A further optimization that may help is to enable "collective" mode, >> which creates a one-to-one mapping between a subset of your processors >> and the OSTs, and involves a communication step similar to the one you >> implemented for the root-only scenario. The other processors send >> their data to the subset, and the subset writes the data to disk (this >> is called "two-phase I/O" or "collective buffering"). The additional >> coordination achieved by collective I/O can improve performance for >> many I/O patterns. You can find more details about this in the NERSC >> parallel I/O tutorial: >> >> http://www.nersc.gov/nusers/help/tutorials/io/ >> >> including some code snippets for how to set this up in HDF5. It also >> summarizes some of the improvements we have been working on, which >> will soon be rolled into the public release of the HDF5 library. >> >> Let me know if you have more questions, or want to continue this >> discussion offline. I would be glad to talk with you further or to >> help you modify your code or run more I/O tests. >> >> Mark >> >> On Thu, Jul 22, 2010 at 12:42 AM, Nikhil Laghave >> <[email protected]> wrote: >>> Hello All, >>> >>> I have a generic question regarding the comparison of seq. binary and >>> parallel HDF5 for I/O of large files. >>> >>> I am using the franklin supercomputer at NERSC for my experiments. The >>> datasets/file size are between 55GB and 111GB which are being written by a >>> single processor in case of seq. binary. In this case, several(~200) >>> processors send the data to a single root processor, which does the I/O >>> disk. So, basically only 1 processor is doing the I/O to disk. >>> >>> In case of parallel HDF5, all the ~200 processors do the I/O to disk >>> independently without communication to the root processor. >>> >>> However, on the LUSTRE file system, there are file locks leading to all the >>> ~200 write operations to be serialized in actuality. >>> >>> Now when I compare the performance of seq. binary vs parallel HDF5, the >>> only difference is that in case of seq. binary, there is communication >>> overhead which according to my measurements are not a big overhead. In that >>> case since both the writes(seq. binary & parallel HDF5) are >>> sequential/serialized, I expected the performance to be similar. However, >>> in my experiments, parallel HDF5 outperforms seq. binary significantly. I >>> do not understand why this so since even parallel HDF5 write operations are >>> serialized. The plot attached explains my doubt. >>> >>> Please can someone explain to me why parallel HDF5 outperforms seq. binary >>> writes even though parallel HDF5 writes are also serialized. Your inputs >>> are greatly appreciated. Thank You. >>> >>> Nikhil >>> >>> >>> _______________________________________________ >>> Hdf-forum is for HDF software users discussion. >>> [email protected] >>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org >>> >>> >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
