Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Quincey Koziol Mon, 21 Feb 2011 04:41:53 -0800

Hi Leigh,

On Feb 17, 2011, at 2:49 PM, Leigh Orf wrote:


> Some background before I get to the problem:
> 
> I am recently attempting the largest simulations I have ever done, so this is 
> uncharted territory for me. I am running on the kraken teragrid resource. The 
> application is a 3D cloud model, and the output consists mostly of 2D and 3D 
> floating point fields.
> 
> Each MPI rank runs on a core. I am not using any OpenMP/threads. This is not 
> an option right now with the way the model is written.
> 
> The full problem size is 3300x3000x350 and I'm using a 2D parallel 
> decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with 
> each rank having 22x15x350 points). This type of geometry is likely what we 
> are 'stuck' with unless we go with a 3D parallel decomposition, and that is 
> not an attractive option.
> 
> I have created a few different MPI communicators to handle I/O. The model 
> writes one single hdf5 file full of 2D and 1D floating point data, as well as 
> a tiny bit of metadata in the form of integers and attributes (I will call 
> this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD 
> communicator - so each of the 30,000 ranks writes to this file. I would 
> prefer not to split this 2D file (which is about 1 GB in size) up, as it's 
> used for a quick look at how the simulation is progressing, and can be 
> visualized directly with software I wrote. For this file, each rank is 
> writing a 22x15 'patch' of floating point data for each field.
> 
> With the files containing the 3D floating point arrays (call them the 3D 
> files), I have it set up such that a flexible number of ranks can each write 
> to a HDF5 file, so long as the numbers divide evenly into the full problem. 
> For instance, I currently have it set up such that each 3D HDF5 file is 
> written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are 
> written for a history dump. So each file contains 3D arrays of size 
> 330x300x330. Hence, these 3D hdf5 files are using a different communicator 
> than MPI_COMM_WORLD that I assemble before any I/O occurs.

        Excellent description, thanks!

> The 2D and 3D files are written at the same time (within the same routine). 
> For each field, I either write 2D and 3D data, or just 2D data. I can turn 
> off writing the 3D data and just write the 2D data, but not the other way 
> around (I could change this and may do so). I currently have a run in the 
> queue where only 2D data is written so I can determine whether the bottleneck 
> is with that file as opposed to the 3D files.
> 
> The problem I am having is abysmal I/O performance, and I am hoping that 
> maybe I can get some pointers. I fully realize that the lustre file system on 
> the kraken teragrid machine is not perfect and has its quirks. However, after 
> 10 minutes of writing the 2D file and the 3D files, I had only output about 
> 10 GB of data.

        That's definitely not a good I/O rate. :-/

> Questions:
> 
> 1. Should I expect poor performance with 30,000 cores writing tiny 2D patches 
> to one file? I have considered creating another communicator and doing 
> MPI_GATHER on this communicator, reassembling the 2D data, and then opening 
> the 2D file using the communicator - this way fewer ranks would be accessing 
> at once. Since I am not familiar with the internals of parallelHDF5, I don't 
> know if doing that is necessary or recommended.

        I don't know if this would help, but I'm definitely interested in 
knowing what happens if you do it.

> 2. Since I have flexibility with the number of 3D files, should I create 
> fewer? More?

        Ditto here.

> 3. There is a command (lfs) on kraken which controls striping patterns. Could 
> I perhaps see better performance by mucking with striping? I have looked 
> through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre Striping and 
> Parallel I/O" but did not come back with any clear message about how I should 
> modify the default settings.

        Ditto here.

> 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try 
> independent (H5FD_MPIO_INDEPENDENT)?

        This should be easy to experiment with, but I don't think it'll help.

> Since I am unsure where the bottleneck is, I'm asking the hdf5 list first, 
> and as I understand it some of the folks here are familiar with the kraken 
> resoruce and have used parallel HDF5 with very large numbers of ranks. Any 
> tips or suggestions for how to wrestle this problem are greatly appreciated.

        I've got some followup questions, which might help future 
optimizations:  Are you chunking the datasets, or are they contiguous?  How 
many datasets are you creating each timestep?  How many timesteps are going 
into each file?

        Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Reply via email to