On Wed, Mar 30, 2011 at 11:50:48AM -0600, Leigh Orf wrote:
> With independent I/O, does each I/O rank open, write, close, and hand it off
> to the next I/O rank, hence only one rank has access to a file at a given
> time (no concurrency)?

no, there's concurrency at the HDF5 level.  Sometimes too much
concurrency....

> With collective I/O, are I/O ranks writing concurrently to one file? If so,
> can you control the number of concurrent accesses to a single file?

HDF5 passes the collective request down to the MPI-IO library.  It's
the lower MPI-IO library (often but not always ROMIO) that will select
how many concurrent readers/writers you can have.


> I have found with collective I/O, only a small subset of writers actually is
> writing concurrently (much less than the total number of ranks) for tense of
> thousands of cores. What controls this number? Also, how is data collected
> to the I/O ranks? MPI_GATHER? It seems you could run the risk of running out
> of memory if you are collecting large 3D arrays to only a few ranks on a
> distributed memory machine.

What platform are you on?  ROMIO will select one processor per compute
node as an "i/o aggregator".  So if you somehow have 30k cores on a
single machine, all the I/O goes through one MPI process (by default).

If you want to change that, you can set the hint "cb_config_list".
The full syntax is kind of weird, but you can set it to "*:2" or "*:4"
or however many processes per node you want to aggregate.

"cb_nodes" is a higher-level hint that just says "pick N of these".  N
is by defualt the number of nodes (not processes), but you can select
lower or higher and ROMIO, in consultation with cb_config_list, will
pick that many.

> I ask these questions because contrary to what I have been told should work,
> I cannot get even marginally decent performance out of collective I/O on
> lustre for large numbers of cores (30kcores writing to one file), and need
> to try new approaches. I am hoping that parallel hdf5 can still be of use to
> me rather than having to do my own MPI calls to collect and write, or just
> doing tried & true one file per core.

Lustre is kind of a pain in the neck with regards to concurrent I/O.
please let me know the platform and MPI implementation you are using
and I'll tell you what you need to do to get good performance out of
it.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to