Hi Leigh,
On Feb 17, 2011, at 2:49 PM, Leigh Orf wrote:
> Some background before I get to the problem:
>
> I am recently attempting the largest simulations I have ever done, so this is
> uncharted territory for me. I am running on the kraken teragrid resource. The
> application is a 3D cloud model, and the output consists mostly of 2D and 3D
> floating point fields.
>
> Each MPI rank runs on a core. I am not using any OpenMP/threads. This is not
> an option right now with the way the model is written.
>
> The full problem size is 3300x3000x350 and I'm using a 2D parallel
> decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with
> each rank having 22x15x350 points). This type of geometry is likely what we
> are 'stuck' with unless we go with a 3D parallel decomposition, and that is
> not an attractive option.
>
> I have created a few different MPI communicators to handle I/O. The model
> writes one single hdf5 file full of 2D and 1D floating point data, as well as
> a tiny bit of metadata in the form of integers and attributes (I will call
> this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
> communicator - so each of the 30,000 ranks writes to this file. I would
> prefer not to split this 2D file (which is about 1 GB in size) up, as it's
> used for a quick look at how the simulation is progressing, and can be
> visualized directly with software I wrote. For this file, each rank is
> writing a 22x15 'patch' of floating point data for each field.
>
> With the files containing the 3D floating point arrays (call them the 3D
> files), I have it set up such that a flexible number of ranks can each write
> to a HDF5 file, so long as the numbers divide evenly into the full problem.
> For instance, I currently have it set up such that each 3D HDF5 file is
> written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
> written for a history dump. So each file contains 3D arrays of size
> 330x300x330. Hence, these 3D hdf5 files are using a different communicator
> than MPI_COMM_WORLD that I assemble before any I/O occurs.
Excellent description, thanks!
> The 2D and 3D files are written at the same time (within the same routine).
> For each field, I either write 2D and 3D data, or just 2D data. I can turn
> off writing the 3D data and just write the 2D data, but not the other way
> around (I could change this and may do so). I currently have a run in the
> queue where only 2D data is written so I can determine whether the bottleneck
> is with that file as opposed to the 3D files.
>
> The problem I am having is abysmal I/O performance, and I am hoping that
> maybe I can get some pointers. I fully realize that the lustre file system on
> the kraken teragrid machine is not perfect and has its quirks. However, after
> 10 minutes of writing the 2D file and the 3D files, I had only output about
> 10 GB of data.
That's definitely not a good I/O rate. :-/
> Questions:
>
> 1. Should I expect poor performance with 30,000 cores writing tiny 2D patches
> to one file? I have considered creating another communicator and doing
> MPI_GATHER on this communicator, reassembling the 2D data, and then opening
> the 2D file using the communicator - this way fewer ranks would be accessing
> at once. Since I am not familiar with the internals of parallelHDF5, I don't
> know if doing that is necessary or recommended.
I don't know if this would help, but I'm definitely interested in
knowing what happens if you do it.
> 2. Since I have flexibility with the number of 3D files, should I create
> fewer? More?
Ditto here.
> 3. There is a command (lfs) on kraken which controls striping patterns. Could
> I perhaps see better performance by mucking with striping? I have looked
> through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre Striping and
> Parallel I/O" but did not come back with any clear message about how I should
> modify the default settings.
Ditto here.
> 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
> independent (H5FD_MPIO_INDEPENDENT)?
This should be easy to experiment with, but I don't think it'll help.
> Since I am unsure where the bottleneck is, I'm asking the hdf5 list first,
> and as I understand it some of the folks here are familiar with the kraken
> resoruce and have used parallel HDF5 with very large numbers of ranks. Any
> tips or suggestions for how to wrestle this problem are greatly appreciated.
I've got some followup questions, which might help future
optimizations: Are you chunking the datasets, or are they contiguous? How
many datasets are you creating each timestep? How many timesteps are going
into each file?
Quincey
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org