Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Leigh Orf Wed, 23 Feb 2011 09:20:49 -0800

On Tue, Feb 22, 2011 at 5:49 PM, Mark Howison <[email protected]>wrote:


> Hi Leigh,
>
> It is true that you need to align writes to Lustre stripe boundaries
> to get reasonable performance to a single shared file. If you use
> collective I/O, as Rob and Quincey have suggested, it will handles
> this automatically (since mpt/3.2) by aggregating your data on a
> subset of "writer" MPI tasks, then packaging the data into
> stripe-sized writes. It will also try to set the number of writers to
> the number of stripes.
>
> Alternatively, if you are writing the same amount of data from every
> task, you can use an independent I/O approach that combines the HDF5
> chunking and alignment properties to guarantee stripe-sized writes.
> The caveat is that your chunks will be padded with empty data out to
> the stripe-size, so this potentially wastes space on disk. In some
> cases, though, we have seen very good performance with independent I/O
> even with up to thousands of tasks, for instance with our GCRM I/O
> benchmark (based on a climate code) on Franklin and Jaguar (both Cray
> XTs). You can read more about that in our "Tuning HDF5 for Lustre"
> paper that you referenced in a previous email. If you go this route,
> you will also want to use two other optimizations we describe in that
> paper: disabling an ftruncate() call at file close that leads to
> catastrophic delays on Lustre, and suspending metadata flushes until
> file close (since the chunk indexing will generate considerable
> metadata activity).
>

Do I assume correctly that using collective I/O that (quoting the "tuning
hdf5 for lustre" document) phdf5 will both "select the correct stripe count"
and also "align operations to stripe boundaries"? Will this apply even if I
use subcommunicators to write several (or hundreds) of hdf5 files at the
same time? I just want to be sure.

It seems that collective I/O is the easy way to go if it takes care of the
underlying decisions to optimize writing. However, do any assumptions go
into this, or is HDF able to query the lfs parameters? On kraken, you can
set the following parameters: number of bytes on each OST, index of the
first stripe, and the number of OSTs to stripe. Seems the only parameter in
question is the number of bytes per OST, and that the OST index of the first
stripe should just be set to the default and that the number of OSTs should
be set to the maximum value (160 on kraken).

What strategy should I use to decide the number of bytes per OST? Should I
try to make it roughly the chunk size I am using for 3D data?  Or... ? You
can set it anywhere from the kB to GB range.

Leigh


-- 
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Reply via email to