Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Leigh Orf Tue, 22 Feb 2011 16:37:53 -0800

On Mon, Feb 21, 2011 at 8:02 AM, Rob Latham <[email protected]> wrote:

> On Thu, Feb 17, 2011 at 01:49:16PM -0700, Leigh Orf wrote:
> > Some background before I get to the problem:
> >
> > I have created a few different MPI communicators to handle I/O. The model
> > writes one single hdf5 file full of 2D and 1D floating point data, as
> well
> > as a tiny bit of metadata in the form of integers and attributes (I will
> > call this the 2D file). The 2D file is accessed through the
> MPI_COMM_WORLD
> > communicator - so each of the 30,000 ranks writes to this file. I would
> > prefer not to split this 2D file (which is about 1 GB in size) up, as
> it's
> > used for a quick look at how the simulation is progressing, and can be
> > visualized directly with software I wrote. For this file, each rank is
> > writing a 22x15 'patch' of floating point data for each field.
>
> One big file, collectively accessed.  Sounds great to me.  What is the
> version of MPT (the cray MPI library) on kraken?  At this point I
> would be shocked if it's older than 3.2 but since you are using
> collective I/O (yay!) make sure you are using MPT 3.2 or newer. ( i
> think the old ones are kept around)
>

module avail sez, among many other things:

xt-mpt/5.0.0(default)

There are other versions available up to xt-mpt/5.2.0

> > With the files containing the 3D floating point arrays (call them the 3D
> > files), I have it set up such that a flexible number of ranks can each
> write
> > to a HDF5 file, so long as the numbers divide evenly into the full
> problem.
> > For instance, I currently have it set up such that each 3D HDF5 file is
> > written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files
> are
> > written for a history dump. So each file contains 3D arrays of size
> > 330x300x330. Hence, these 3D hdf5 files are using a different
> communicator
> > than MPI_COMM_WORLD that I assemble before any I/O occurs.
>
> This is clever, and does let you tune at the application level, but I
> don't think it's necessary.   Often the MPI-IO hints are better suited
> for such tuning, but once you've upped the striping factor (mark's
> email) I don't think you'll need those either.
>

Now you tell me :) Well, since I'm preparing for 100,000 cores on the
upcoming blue waters machine, having one gazillobyte file containing the
full domain is not an attractive option... for several reasons. I have
always assumed we'd end up writing multiple files per history file dump, and
have been under the impression from the blue waters folks that something
between 1 and numcores files is probably going to provide the best
performance. So that's why I went down this path. Another logistical reason
is because only a small portion of the full model domain typically has the
interesting bits and rather than writing code to extract from the monster
file, it's easier to just pick the files I need. And, finally, we can toss
out parts of the domain we don't need (like along the edges). Etc. Our
proposed simulations are going to produce PB of data. Incidentally I did
receive an email from one of the blue waters folks who wants to work with me
on optimizing I/O (they have seen my code). So I will happily share anything
I learn from them.

>
> > 1. Should I expect poor performance with 30,000 cores writing tiny 2D
> > patches to one file? I have considered creating another communicator and
> > doing MPI_GATHER on this communicator, reassembling the 2D data, and then
> > opening the 2D file using the communicator - this way fewer ranks would
> be
> > accessing at once. Since I am not familiar with the internals of
> > parallelHDF5, I don't know if doing that is necessary or recommended.
>
> This workload (30,000 cores writing a tiny patch) is perfect for
> collective I/O.  This gather idea you have is kind of like what will
> happen inside the MPI-IO library, except the MPI-IO library will
> reduce the operation to several writers, not just one.  Also, it's
> been tested and debugged for you.
>

Indeed, after my initial email on this problem, I saw that the 2D files were
written quite quickly on kraken - it only took say 10 seconds to write, with
30,000 cores using MPI_COMM_WORLD. That made me happy. So I think I'm good
with the all-to-one 2d file. On to 3d...

>
> > 2. Since I have flexibility with the number of 3D files, should I create
> > fewer? More?
>
> The usual parallel I/O advice is to do collective I/O to a single
> shared file.  Lustre does tend to perform better when more files are
> used but for the sake of your post-processing sanity, let's see what
> happens if we keep a single file for now.
>

I am nervous about doing that. Partly because I am trying to make sense of
this:

http://www.nics.tennessee.edu/io-tips

which has some confusing information, at least to me. They claim performance
degradation with many cores when you are doing single-shared-file (your
suggestion) and also 1 file per process (what I used to which is fine for
fewer cores).

There is also the issue of somehow mapping your writes to the stripe size,
which is an option you can set with lfs. Check out the figure caption to
figure 3 which states:

"Write Performance for serial I/O at various Lustre stripe counts. File size
is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing
more OSTs does not increase write performance. * The Best performance is
seen by utilizing a stripe size which matches the size of write operations.
*"

I have no idea how to control the size of "write operations" whatever they
are. Maybe there is a way to set this with hdf5?

>
>
> > 3. There is a command (lfs) on kraken which controls striping patterns.
> > Could I perhaps see better performance by mucking with striping? I have
> > looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
> > Striping and Parallel I/O" but did not come back with any clear message
> > about how I should modify the default settings.
> >
> > 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
> > independent (H5FD_MPIO_INDEPENDENT)?
>
> I would suggest keeping HDF5 collective I/O enabled at all times.  If
> you find each process is writing on the order of 4 MiB of data, you
> might want to force, at the MPI-IO level, independent I/O.  Here again
> you do so with MPI-IO tuning parameters.  We can go into more detail
> later, if it's even needed.
>
>
Roger that. I kind of figured collective is always better if it's available.
Oddly enough, I *have* to use independent for blueprint (AIX machine) or it
barfs. But at least I can still do phdf5.

I have time on kraken and wish to do some very large but very short
"simulations" where I just start the model, run one time step, and dump
files, and do timings. My head hurts right now with too many knobs to turn
(#of phdf5 files, lfs options, etc.).

==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

-- 
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Reply via email to