Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Leigh Orf Tue, 22 Feb 2011 16:10:01 -0800

Quincey,

Answer to your questions below... no top-posting today :)


On Mon, Feb 21, 2011 at 5:38 AM, Quincey Koziol <[email protected]> wrote:

> Hi Leigh,
>
> On Feb 17, 2011, at 2:49 PM, Leigh Orf wrote:
>
> Some background before I get to the problem:
>
> I am recently attempting the largest simulations I have ever done, so this
> is uncharted territory for me. I am running on the kraken teragrid resource.
> The application is a 3D cloud model, and the output consists mostly of 2D
> and 3D floating point fields.
>
> Each MPI rank runs on a core. I am not using any OpenMP/threads. This is
> not an option right now with the way the model is written.
>
> The full problem size is 3300x3000x350 and I'm using a 2D parallel
> decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with
> each rank having 22x15x350 points). This type of geometry is likely what we
> are 'stuck' with unless we go with a 3D parallel decomposition, and that is
> not an attractive option.
>
> I have created a few different MPI communicators to handle I/O. The model
> writes one single hdf5 file full of 2D and 1D floating point data, as well
> as a tiny bit of metadata in the form of integers and attributes (I will
> call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD
> communicator - so each of the 30,000 ranks writes to this file. I would
> prefer not to split this 2D file (which is about 1 GB in size) up, as it's
> used for a quick look at how the simulation is progressing, and can be
> visualized directly with software I wrote. For this file, each rank is
> writing a 22x15 'patch' of floating point data for each field.
>
> With the files containing the 3D floating point arrays (call them the 3D
> files), I have it set up such that a flexible number of ranks can each write
> to a HDF5 file, so long as the numbers divide evenly into the full problem.
> For instance, I currently have it set up such that each 3D HDF5 file is
> written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are
> written for a history dump. So each file contains 3D arrays of size
> 330x300x330. Hence, these 3D hdf5 files are using a different communicator
> than MPI_COMM_WORLD that I assemble before any I/O occurs.
>
>
> Excellent description, thanks!
>
> The 2D and 3D files are written at the same time (within the same routine).
> For each field, I either write 2D and 3D data, or just 2D data. I can turn
> off writing the 3D data and just write the 2D data, but not the other way
> around (I could change this and may do so). I currently have a run in the
> queue where only 2D data is written so I can determine whether the
> bottleneck is with that file as opposed to the 3D files.
>
> The problem I am having is abysmal I/O performance, and I am hoping that
> maybe I can get some pointers. I fully realize that the lustre file system
> on the kraken teragrid machine is not perfect and has its quirks. However,
> after 10 minutes of writing the 2D file and the 3D files, I had only output
> about 10 GB of data.
>
>
> That's definitely not a good I/O rate. :-/
>
> Questions:
>
> 1. Should I expect poor performance with 30,000 cores writing tiny 2D
> patches to one file? I have considered creating another communicator and
> doing MPI_GATHER on this communicator, reassembling the 2D data, and then
> opening the 2D file using the communicator - this way fewer ranks would be
> accessing at once. Since I am not familiar with the internals of
> parallelHDF5, I don't know if doing that is necessary or recommended.
>
>
> I don't know if this would help, but I'm definitely interested in knowing
> what happens if you do it.
>
> 2. Since I have flexibility with the number of 3D files, should I create
> fewer? More?
>
>
> Ditto here.
>
> 3. There is a command (lfs) on kraken which controls striping patterns.
> Could I perhaps see better performance by mucking with striping? I have
> looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre
> Striping and Parallel I/O" but did not come back with any clear message
> about how I should modify the default settings.
>
>
> Ditto here.
>
> 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try
> independent (H5FD_MPIO_INDEPENDENT)?
>
>
> This should be easy to experiment with, but I don't think it'll help.
>
> Since I am unsure where the bottleneck is, I'm asking the hdf5 list first,
> and as I understand it some of the folks here are familiar with the kraken
> resoruce and have used parallel HDF5 with very large numbers of ranks. Any
> tips or suggestions for how to wrestle this problem are greatly appreciated.
>
>
> I've got some followup questions, which might help future optimizations:
>  Are you chunking the datasets, or are they contiguous?
>


I am chunking the datasets by the dimensions of the what is running on each
core (each MPI rank runs on 1 core). So, if I have 3d arrays dimensioned by
15x15x200 on each core, and 4x4 cores on each MPI communicator, the chunk
dimensions are 15x15x200 and the array dimension written to each HDF5 file
is 60x60x200.

A snippet from my code follows. The core dimensions are ni x nj x nk. The
file dimensions are ionumi x ionumj x nk. ionumi = ni * corex, where corex
is the number of cores in the x direction spanning 1 file, same for y. Since
I am only doing a 2d parallel decomposition, nk spans the full vertical
extent.

mygroupi goes from 0 to corex-1, mygroupj goes from 0 to corey-1.

      dims(1)=ionumi
      dims(2)=ionumj
      dims(3)=nk

      chunkdims(1)=ni
      chunkdims(2)=nj
      chunkdims(3)=nk

      count(1)=1
      count(2)=1
      count(3)=1

      offset(1) = mygroupi * chunkdims(1)
      offset(2) = mygroupj * chunkdims(2)
      offset(3) = 0

      stride(1) = 1
      stride(2) = 1
      stride(3) = 1

      block(1) = chunkdims(1)
      block(2) = chunkdims(2)
      block(3) = chunkdims(3)

      call h5screate_simple_f(rank,dims,filespace_id,ierror)
      call h5screate_simple_f(rank,chunkdims,memspace_id,ierror)
      call h5pcreate_f(H5P_DATASET_CREATE_F,chunk_id,ierror)
      call h5pset_chunk_f(chunk_id,rank,chunkdims,ierror)
      call
h5dcreate_f(file_id,trim(varname),H5T_NATIVE_REAL,filespace_id,dset_id,ierror,chunk_id)
      call h5sclose_f(filespace_id,ierror)

      call h5dget_space_f(dset_id, filespace_id, ierror)
      call h5sselect_hyperslab_f (filespace_id, H5S_SELECT_SET_F, offset,
count, ierror,stride,block)
      call h5pcreate_f(H5P_DATASET_XFER_F, plist_id, ierror)
      call h5pset_dxpl_mpio_f(plist_id, MPIO, ierror)
      call h5dwrite_f(dset_id, H5T_NATIVE_REAL, core3d(1:ni,1:nj,1:nk),
dims, ierror, &
                 file_space_id = filespace_id, mem_space_id = memspace_id,
xfer_prp = plist_id)

 >How many datasets are you creating each timestep?

This is a selectable option. Here is a typical scenario. In this case, just
for some background, corex=4, corey=6 (16 cores per file) and there are 16
files per full domain write. So each .cm1.hdf5 file contains 1/16th of the
full domain. the .2Dcm1hdf5 file contains primarily 2D slices of the full
domain. It is written by *ALL* cores (and performance to this file is good,
even on 30,000 cores writing to it on kraken).

bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % ls -l

-rw-r--r--    1 orf      jmd        58393968 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0000.2Dcm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0000.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0001.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0002.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0003.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0004.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0005.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0006.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0007.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0008.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0010.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0011.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0012.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0013.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0014.cm1hdf5
-rw-r--r--    1 orf      jmd       342313312 Feb 22 17:57
L500ang120_0.010_1000.0m.03600_0015.cm1hdf5


bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep Dataset
/2d/cpc                  Dataset {176/176, 140/140}
/2d/cph                  Dataset {176/176, 140/140}
/2d/cref                 Dataset {176/176, 140/140}
/2d/maxsgs               Dataset {176/176, 140/140}
/2d/maxshs               Dataset {176/176, 140/140}
/2d/maxsrs               Dataset {176/176, 140/140}
/2d/maxsus               Dataset {176/176, 140/140}
/2d/maxsvs               Dataset {176/176, 140/140}
/2d/maxsws               Dataset {176/176, 140/140}
/2d/minsps               Dataset {176/176, 140/140}
/2d/sfcrain              Dataset {176/176, 140/140}
/2d/uh                   Dataset {176/176, 140/140}
/3d/dbz                  Dataset {96/96, 176/176, 140/140}
/3d/khh                  Dataset {96/96, 176/176, 140/140}
/3d/khv                  Dataset {96/96, 176/176, 140/140}
/3d/kmh                  Dataset {96/96, 176/176, 140/140}
/3d/kmv                  Dataset {96/96, 176/176, 140/140}
/3d/ncg                  Dataset {96/96, 176/176, 140/140}
/3d/nci                  Dataset {96/96, 176/176, 140/140}
/3d/ncr                  Dataset {96/96, 176/176, 140/140}
/3d/ncs                  Dataset {96/96, 176/176, 140/140}
/3d/p                    Dataset {96/96, 176/176, 140/140}
/3d/pi                   Dataset {96/96, 176/176, 140/140}
/3d/pipert               Dataset {96/96, 176/176, 140/140}
/3d/ppert                Dataset {96/96, 176/176, 140/140}
/3d/qc                   Dataset {96/96, 176/176, 140/140}
/3d/qg                   Dataset {96/96, 176/176, 140/140}
/3d/qi                   Dataset {96/96, 176/176, 140/140}
/3d/qr                   Dataset {96/96, 176/176, 140/140}
/3d/qs                   Dataset {96/96, 176/176, 140/140}
/3d/qv                   Dataset {96/96, 176/176, 140/140}
/3d/qvpert               Dataset {96/96, 176/176, 140/140}
/3d/rho                  Dataset {96/96, 176/176, 140/140}
/3d/rhopert              Dataset {96/96, 176/176, 140/140}
/3d/th                   Dataset {96/96, 176/176, 140/140}
/3d/thpert               Dataset {96/96, 176/176, 140/140}
/3d/tke                  Dataset {96/96, 176/176, 140/140}
/3d/u                    Dataset {96/96, 176/176, 140/140}
/3d/u_yzlast             Dataset {96/96, 176/176}
/3d/uinterp              Dataset {96/96, 176/176, 140/140}
/3d/upert                Dataset {96/96, 176/176, 140/140}
/3d/upert_yzlast         Dataset {96/96, 176/176}
/3d/v                    Dataset {96/96, 176/176, 140/140}
/3d/v_xzlast             Dataset {96/96, 140/140}
/3d/vinterp              Dataset {96/96, 176/176, 140/140}
/3d/vpert                Dataset {96/96, 176/176, 140/140}
/3d/vpert_xzlast         Dataset {96/96, 140/140}
/3d/w                    Dataset {97/97, 176/176, 140/140}
/3d/winterp              Dataset {96/96, 176/176, 140/140}
/3d/xvort                Dataset {96/96, 176/176, 140/140}
/3d/yvort                Dataset {96/96, 176/176, 140/140}
/3d/zvort                Dataset {96/96, 176/176, 140/140}
/basestate/pi0           Dataset {96/96}
/basestate/pres0         Dataset {96/96}
/basestate/qv0           Dataset {96/96}
/basestate/rh0           Dataset {96/96}
/basestate/th0           Dataset {96/96}
/basestate/u0            Dataset {96/96}
/basestate/v0            Dataset {96/96}
/grid/myi                Dataset {1/1}
/grid/myj                Dataset {1/1}
/grid/ni                 Dataset {1/1}
/grid/nj                 Dataset {1/1}
/grid/nodex              Dataset {1/1}
/grid/nodey              Dataset {1/1}
/grid/nx                 Dataset {1/1}
/grid/ny                 Dataset {1/1}
/grid/nz                 Dataset {1/1}
/grid/x0                 Dataset {1/1}
/grid/x1                 Dataset {1/1}
/grid/y0                 Dataset {1/1}
/grid/y1                 Dataset {1/1}
/mesh/dx                 Dataset {1/1}
/mesh/dy                 Dataset {1/1}
/mesh/xf                 Dataset {140/140}
/mesh/xh                 Dataset {140/140}
/mesh/yf                 Dataset {176/176}
/mesh/yh                 Dataset {176/176}
/mesh/zf                 Dataset {97/97}
/mesh/zh                 Dataset {96/96}
/time                    Dataset {1/1}

bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv
L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep 3d | grep -v zlast | wc
-l
      44

So there are 44 3D fields in this case. That's pretty much the kitchen sink,
normally I'd probably be writing half as many datasets.

Notice also I've got a bunch of tiny bits which serve as metadata (for
stitching things back together for analysis), some small 1d arrays, some 2d
arrays, and then the big 3d arrays. Except for the *zlast arrays, all of the
stuff in /3d is three-dimensional as you can see. The *zlast stuff is
because some of the variables have an extra point in the x or y direction
(GRR staggered grids) and I just write the last planes out in a separate
dataset. This is because I am splitting up the writes into separate hdf5
files. Were I writing only one file, it would be easier.

As far as the dimensions of the arrays you see here, don't take them too
seriously, this was from a run on another machine. I am holding off on
kraken until I can get at least a decent idea of what to try to improve I/O.

>How many timesteps are going into each file?

Only one.


>
> Quincey
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>


-- 
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Poor write performance with 30, 000 MPI ranks (pHDF5)

Reply via email to