Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Rob Latham Mon, 21 Feb 2011 07:04:15 -0800

On Thu, Feb 17, 2011 at 01:49:16PM -0700, Leigh Orf wrote:
> Some background before I get to the problem:
> 
> I have created a few different MPI communicators to handle I/O. The model
> writes one single hdf5 file full of 2D and 1D floating point data, as well
> as a tiny bit of metadata in the form of integers and attributes (I will
> call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
> communicator - so each of the 30,000 ranks writes to this file. I would
> prefer not to split this 2D file (which is about 1 GB in size) up, as it's
> used for a quick look at how the simulation is progressing, and can be
> visualized directly with software I wrote. For this file, each rank is
> writing a 22x15 'patch' of floating point data for each field.


One big file, collectively accessed.  Sounds great to me.  What is the
version of MPT (the cray MPI library) on kraken?  At this point I
would be shocked if it's older than 3.2 but since you are using
collective I/O (yay!) make sure you are using MPT 3.2 or newer. ( i
think the old ones are kept around)

> With the files containing the 3D floating point arrays (call them the 3D
> files), I have it set up such that a flexible number of ranks can each write
> to a HDF5 file, so long as the numbers divide evenly into the full problem.
> For instance, I currently have it set up such that each 3D HDF5 file is
> written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
> written for a history dump. So each file contains 3D arrays of size
> 330x300x330. Hence, these 3D hdf5 files are using a different communicator
> than MPI_COMM_WORLD that I assemble before any I/O occurs.

This is clever, and does let you tune at the application level, but I
don't think it's necessary.   Often the MPI-IO hints are better suited
for such tuning, but once you've upped the striping factor (mark's
email) I don't think you'll need those either.

> 1. Should I expect poor performance with 30,000 cores writing tiny 2D
> patches to one file? I have considered creating another communicator and
> doing MPI_GATHER on this communicator, reassembling the 2D data, and then
> opening the 2D file using the communicator - this way fewer ranks would be
> accessing at once. Since I am not familiar with the internals of
> parallelHDF5, I don't know if doing that is necessary or recommended.

This workload (30,000 cores writing a tiny patch) is perfect for
collective I/O.  This gather idea you have is kind of like what will
happen inside the MPI-IO library, except the MPI-IO library will
reduce the operation to several writers, not just one.  Also, it's
been tested and debugged for you. 

> 2. Since I have flexibility with the number of 3D files, should I create
> fewer? More?

The usual parallel I/O advice is to do collective I/O to a single
shared file.  Lustre does tend to perform better when more files are
used but for the sake of your post-processing sanity, let's see what
happens if we keep a single file for now.


> 3. There is a command (lfs) on kraken which controls striping patterns.
> Could I perhaps see better performance by mucking with striping? I have
> looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
> Striping and Parallel I/O" but did not come back with any clear message
> about how I should modify the default settings.
> 
> 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
> independent (H5FD_MPIO_INDEPENDENT)?

I would suggest keeping HDF5 collective I/O enabled at all times.  If
you find each process is writing on the order of 4 MiB of data, you
might want to force, at the MPI-IO level, independent I/O.  Here again
you do so with MPI-IO tuning parameters.  We can go into more detail
later, if it's even needed.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Reply via email to