[Hdf-forum] Poor write performance with 30,000 MPI ranks (pHDF5)

Leigh Orf Thu, 17 Feb 2011 12:51:25 -0800

Some background before I get to the problem:

I am recently attempting the largest simulations I have ever done, so this
is uncharted territory for me. I am running on the kraken teragrid resource.
The application is a 3D cloud model, and the output consists mostly of 2D
and 3D floating point fields.


Each MPI rank runs on a core. I am not using any OpenMP/threads. This is not
an option right now with the way the model is written.

The full problem size is 3300x3000x350 and I'm using a 2D parallel
decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with
each rank having 22x15x350 points). This type of geometry is likely what we
are 'stuck' with unless we go with a 3D parallel decomposition, and that is
not an attractive option.

I have created a few different MPI communicators to handle I/O. The model
writes one single hdf5 file full of 2D and 1D floating point data, as well
as a tiny bit of metadata in the form of integers and attributes (I will
call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
communicator - so each of the 30,000 ranks writes to this file. I would
prefer not to split this 2D file (which is about 1 GB in size) up, as it's
used for a quick look at how the simulation is progressing, and can be
visualized directly with software I wrote. For this file, each rank is
writing a 22x15 'patch' of floating point data for each field.

With the files containing the 3D floating point arrays (call them the 3D
files), I have it set up such that a flexible number of ranks can each write
to a HDF5 file, so long as the numbers divide evenly into the full problem.
For instance, I currently have it set up such that each 3D HDF5 file is
written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
written for a history dump. So each file contains 3D arrays of size
330x300x330. Hence, these 3D hdf5 files are using a different communicator
than MPI_COMM_WORLD that I assemble before any I/O occurs.

The 2D and 3D files are written at the same time (within the same routine).
For each field, I either write 2D and 3D data, or just 2D data. I can turn
off writing the 3D data and just write the 2D data, but not the other way
around (I could change this and may do so). I currently have a run in the
queue where only 2D data is written so I can determine whether the
bottleneck is with that file as opposed to the 3D files.

The problem I am having is abysmal I/O performance, and I am hoping that
maybe I can get some pointers. I fully realize that the lustre file system
on the kraken teragrid machine is not perfect and has its quirks. However,
after 10 minutes of writing the 2D file and the 3D files, I had only output
about 10 GB of data.

Questions:

1. Should I expect poor performance with 30,000 cores writing tiny 2D
patches to one file? I have considered creating another communicator and
doing MPI_GATHER on this communicator, reassembling the 2D data, and then
opening the 2D file using the communicator - this way fewer ranks would be
accessing at once. Since I am not familiar with the internals of
parallelHDF5, I don't know if doing that is necessary or recommended.

2. Since I have flexibility with the number of 3D files, should I create
fewer? More?

3. There is a command (lfs) on kraken which controls striping patterns.
Could I perhaps see better performance by mucking with striping? I have
looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
Striping and Parallel I/O" but did not come back with any clear message
about how I should modify the default settings.

4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
independent (H5FD_MPIO_INDEPENDENT)?

Since I am unsure where the bottleneck is, I'm asking the hdf5 list first,
and as I understand it some of the folks here are familiar with the kraken
resoruce and have used parallel HDF5 with very large numbers of ranks. Any
tips or suggestions for how to wrestle this problem are greatly appreciated.

Thanks,

Leigh


-- 
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

[Hdf-forum] Poor write performance with 30,000 MPI ranks (pHDF5)

Reply via email to