On Wed, Mar 30, 2011 at 11:50:48AM -0600, Leigh Orf wrote: > With independent I/O, does each I/O rank open, write, close, and hand it off > to the next I/O rank, hence only one rank has access to a file at a given > time (no concurrency)?
no, there's concurrency at the HDF5 level. Sometimes too much concurrency.... > With collective I/O, are I/O ranks writing concurrently to one file? If so, > can you control the number of concurrent accesses to a single file? HDF5 passes the collective request down to the MPI-IO library. It's the lower MPI-IO library (often but not always ROMIO) that will select how many concurrent readers/writers you can have. > I have found with collective I/O, only a small subset of writers actually is > writing concurrently (much less than the total number of ranks) for tense of > thousands of cores. What controls this number? Also, how is data collected > to the I/O ranks? MPI_GATHER? It seems you could run the risk of running out > of memory if you are collecting large 3D arrays to only a few ranks on a > distributed memory machine. What platform are you on? ROMIO will select one processor per compute node as an "i/o aggregator". So if you somehow have 30k cores on a single machine, all the I/O goes through one MPI process (by default). If you want to change that, you can set the hint "cb_config_list". The full syntax is kind of weird, but you can set it to "*:2" or "*:4" or however many processes per node you want to aggregate. "cb_nodes" is a higher-level hint that just says "pick N of these". N is by defualt the number of nodes (not processes), but you can select lower or higher and ROMIO, in consultation with cb_config_list, will pick that many. > I ask these questions because contrary to what I have been told should work, > I cannot get even marginally decent performance out of collective I/O on > lustre for large numbers of cores (30kcores writing to one file), and need > to try new approaches. I am hoping that parallel hdf5 can still be of use to > me rather than having to do my own MPI calls to collect and write, or just > doing tried & true one file per core. Lustre is kind of a pain in the neck with regards to concurrent I/O. please let me know the platform and MPI implementation you are using and I'll tell you what you need to do to get good performance out of it. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
