Quincey, Answer to your questions below... no top-posting today :)
On Mon, Feb 21, 2011 at 5:38 AM, Quincey Koziol <[email protected]> wrote: > Hi Leigh, > > On Feb 17, 2011, at 2:49 PM, Leigh Orf wrote: > > Some background before I get to the problem: > > I am recently attempting the largest simulations I have ever done, so this > is uncharted territory for me. I am running on the kraken teragrid resource. > The application is a 3D cloud model, and the output consists mostly of 2D > and 3D floating point fields. > > Each MPI rank runs on a core. I am not using any OpenMP/threads. This is > not an option right now with the way the model is written. > > The full problem size is 3300x3000x350 and I'm using a 2D parallel > decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with > each rank having 22x15x350 points). This type of geometry is likely what we > are 'stuck' with unless we go with a 3D parallel decomposition, and that is > not an attractive option. > > I have created a few different MPI communicators to handle I/O. The model > writes one single hdf5 file full of 2D and 1D floating point data, as well > as a tiny bit of metadata in the form of integers and attributes (I will > call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD > communicator - so each of the 30,000 ranks writes to this file. I would > prefer not to split this 2D file (which is about 1 GB in size) up, as it's > used for a quick look at how the simulation is progressing, and can be > visualized directly with software I wrote. For this file, each rank is > writing a 22x15 'patch' of floating point data for each field. > > With the files containing the 3D floating point arrays (call them the 3D > files), I have it set up such that a flexible number of ranks can each write > to a HDF5 file, so long as the numbers divide evenly into the full problem. > For instance, I currently have it set up such that each 3D HDF5 file is > written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are > written for a history dump. So each file contains 3D arrays of size > 330x300x330. Hence, these 3D hdf5 files are using a different communicator > than MPI_COMM_WORLD that I assemble before any I/O occurs. > > > Excellent description, thanks! > > The 2D and 3D files are written at the same time (within the same routine). > For each field, I either write 2D and 3D data, or just 2D data. I can turn > off writing the 3D data and just write the 2D data, but not the other way > around (I could change this and may do so). I currently have a run in the > queue where only 2D data is written so I can determine whether the > bottleneck is with that file as opposed to the 3D files. > > The problem I am having is abysmal I/O performance, and I am hoping that > maybe I can get some pointers. I fully realize that the lustre file system > on the kraken teragrid machine is not perfect and has its quirks. However, > after 10 minutes of writing the 2D file and the 3D files, I had only output > about 10 GB of data. > > > That's definitely not a good I/O rate. :-/ > > Questions: > > 1. Should I expect poor performance with 30,000 cores writing tiny 2D > patches to one file? I have considered creating another communicator and > doing MPI_GATHER on this communicator, reassembling the 2D data, and then > opening the 2D file using the communicator - this way fewer ranks would be > accessing at once. Since I am not familiar with the internals of > parallelHDF5, I don't know if doing that is necessary or recommended. > > > I don't know if this would help, but I'm definitely interested in knowing > what happens if you do it. > > 2. Since I have flexibility with the number of 3D files, should I create > fewer? More? > > > Ditto here. > > 3. There is a command (lfs) on kraken which controls striping patterns. > Could I perhaps see better performance by mucking with striping? I have > looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre > Striping and Parallel I/O" but did not come back with any clear message > about how I should modify the default settings. > > > Ditto here. > > 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try > independent (H5FD_MPIO_INDEPENDENT)? > > > This should be easy to experiment with, but I don't think it'll help. > > Since I am unsure where the bottleneck is, I'm asking the hdf5 list first, > and as I understand it some of the folks here are familiar with the kraken > resoruce and have used parallel HDF5 with very large numbers of ranks. Any > tips or suggestions for how to wrestle this problem are greatly appreciated. > > > I've got some followup questions, which might help future optimizations: > Are you chunking the datasets, or are they contiguous? > I am chunking the datasets by the dimensions of the what is running on each core (each MPI rank runs on 1 core). So, if I have 3d arrays dimensioned by 15x15x200 on each core, and 4x4 cores on each MPI communicator, the chunk dimensions are 15x15x200 and the array dimension written to each HDF5 file is 60x60x200. A snippet from my code follows. The core dimensions are ni x nj x nk. The file dimensions are ionumi x ionumj x nk. ionumi = ni * corex, where corex is the number of cores in the x direction spanning 1 file, same for y. Since I am only doing a 2d parallel decomposition, nk spans the full vertical extent. mygroupi goes from 0 to corex-1, mygroupj goes from 0 to corey-1. dims(1)=ionumi dims(2)=ionumj dims(3)=nk chunkdims(1)=ni chunkdims(2)=nj chunkdims(3)=nk count(1)=1 count(2)=1 count(3)=1 offset(1) = mygroupi * chunkdims(1) offset(2) = mygroupj * chunkdims(2) offset(3) = 0 stride(1) = 1 stride(2) = 1 stride(3) = 1 block(1) = chunkdims(1) block(2) = chunkdims(2) block(3) = chunkdims(3) call h5screate_simple_f(rank,dims,filespace_id,ierror) call h5screate_simple_f(rank,chunkdims,memspace_id,ierror) call h5pcreate_f(H5P_DATASET_CREATE_F,chunk_id,ierror) call h5pset_chunk_f(chunk_id,rank,chunkdims,ierror) call h5dcreate_f(file_id,trim(varname),H5T_NATIVE_REAL,filespace_id,dset_id,ierror,chunk_id) call h5sclose_f(filespace_id,ierror) call h5dget_space_f(dset_id, filespace_id, ierror) call h5sselect_hyperslab_f (filespace_id, H5S_SELECT_SET_F, offset, count, ierror,stride,block) call h5pcreate_f(H5P_DATASET_XFER_F, plist_id, ierror) call h5pset_dxpl_mpio_f(plist_id, MPIO, ierror) call h5dwrite_f(dset_id, H5T_NATIVE_REAL, core3d(1:ni,1:nj,1:nk), dims, ierror, & file_space_id = filespace_id, mem_space_id = memspace_id, xfer_prp = plist_id) >How many datasets are you creating each timestep? This is a selectable option. Here is a typical scenario. In this case, just for some background, corex=4, corey=6 (16 cores per file) and there are 16 files per full domain write. So each .cm1.hdf5 file contains 1/16th of the full domain. the .2Dcm1hdf5 file contains primarily 2D slices of the full domain. It is written by *ALL* cores (and performance to this file is good, even on 30,000 cores writing to it on kraken). bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % ls -l -rw-r--r-- 1 orf jmd 58393968 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0000.2Dcm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0000.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0001.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0002.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0003.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0004.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0005.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0006.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0007.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0008.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0010.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0011.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0012.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0013.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0014.cm1hdf5 -rw-r--r-- 1 orf jmd 342313312 Feb 22 17:57 L500ang120_0.010_1000.0m.03600_0015.cm1hdf5 bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep Dataset /2d/cpc Dataset {176/176, 140/140} /2d/cph Dataset {176/176, 140/140} /2d/cref Dataset {176/176, 140/140} /2d/maxsgs Dataset {176/176, 140/140} /2d/maxshs Dataset {176/176, 140/140} /2d/maxsrs Dataset {176/176, 140/140} /2d/maxsus Dataset {176/176, 140/140} /2d/maxsvs Dataset {176/176, 140/140} /2d/maxsws Dataset {176/176, 140/140} /2d/minsps Dataset {176/176, 140/140} /2d/sfcrain Dataset {176/176, 140/140} /2d/uh Dataset {176/176, 140/140} /3d/dbz Dataset {96/96, 176/176, 140/140} /3d/khh Dataset {96/96, 176/176, 140/140} /3d/khv Dataset {96/96, 176/176, 140/140} /3d/kmh Dataset {96/96, 176/176, 140/140} /3d/kmv Dataset {96/96, 176/176, 140/140} /3d/ncg Dataset {96/96, 176/176, 140/140} /3d/nci Dataset {96/96, 176/176, 140/140} /3d/ncr Dataset {96/96, 176/176, 140/140} /3d/ncs Dataset {96/96, 176/176, 140/140} /3d/p Dataset {96/96, 176/176, 140/140} /3d/pi Dataset {96/96, 176/176, 140/140} /3d/pipert Dataset {96/96, 176/176, 140/140} /3d/ppert Dataset {96/96, 176/176, 140/140} /3d/qc Dataset {96/96, 176/176, 140/140} /3d/qg Dataset {96/96, 176/176, 140/140} /3d/qi Dataset {96/96, 176/176, 140/140} /3d/qr Dataset {96/96, 176/176, 140/140} /3d/qs Dataset {96/96, 176/176, 140/140} /3d/qv Dataset {96/96, 176/176, 140/140} /3d/qvpert Dataset {96/96, 176/176, 140/140} /3d/rho Dataset {96/96, 176/176, 140/140} /3d/rhopert Dataset {96/96, 176/176, 140/140} /3d/th Dataset {96/96, 176/176, 140/140} /3d/thpert Dataset {96/96, 176/176, 140/140} /3d/tke Dataset {96/96, 176/176, 140/140} /3d/u Dataset {96/96, 176/176, 140/140} /3d/u_yzlast Dataset {96/96, 176/176} /3d/uinterp Dataset {96/96, 176/176, 140/140} /3d/upert Dataset {96/96, 176/176, 140/140} /3d/upert_yzlast Dataset {96/96, 176/176} /3d/v Dataset {96/96, 176/176, 140/140} /3d/v_xzlast Dataset {96/96, 140/140} /3d/vinterp Dataset {96/96, 176/176, 140/140} /3d/vpert Dataset {96/96, 176/176, 140/140} /3d/vpert_xzlast Dataset {96/96, 140/140} /3d/w Dataset {97/97, 176/176, 140/140} /3d/winterp Dataset {96/96, 176/176, 140/140} /3d/xvort Dataset {96/96, 176/176, 140/140} /3d/yvort Dataset {96/96, 176/176, 140/140} /3d/zvort Dataset {96/96, 176/176, 140/140} /basestate/pi0 Dataset {96/96} /basestate/pres0 Dataset {96/96} /basestate/qv0 Dataset {96/96} /basestate/rh0 Dataset {96/96} /basestate/th0 Dataset {96/96} /basestate/u0 Dataset {96/96} /basestate/v0 Dataset {96/96} /grid/myi Dataset {1/1} /grid/myj Dataset {1/1} /grid/ni Dataset {1/1} /grid/nj Dataset {1/1} /grid/nodex Dataset {1/1} /grid/nodey Dataset {1/1} /grid/nx Dataset {1/1} /grid/ny Dataset {1/1} /grid/nz Dataset {1/1} /grid/x0 Dataset {1/1} /grid/x1 Dataset {1/1} /grid/y0 Dataset {1/1} /grid/y1 Dataset {1/1} /mesh/dx Dataset {1/1} /mesh/dy Dataset {1/1} /mesh/xf Dataset {140/140} /mesh/xh Dataset {140/140} /mesh/yf Dataset {176/176} /mesh/yh Dataset {176/176} /mesh/zf Dataset {97/97} /mesh/zh Dataset {96/96} /time Dataset {1/1} bp-login1: /scr/orf/Lnew/L500ang120_0.010_1000.0m.00000.cdir % h5ls -rv L500ang120_0.010_1000.0m.03600_0009.cm1hdf5 | grep 3d | grep -v zlast | wc -l 44 So there are 44 3D fields in this case. That's pretty much the kitchen sink, normally I'd probably be writing half as many datasets. Notice also I've got a bunch of tiny bits which serve as metadata (for stitching things back together for analysis), some small 1d arrays, some 2d arrays, and then the big 3d arrays. Except for the *zlast arrays, all of the stuff in /3d is three-dimensional as you can see. The *zlast stuff is because some of the variables have an extra point in the x or y direction (GRR staggered grids) and I just write the last planes out in a separate dataset. This is because I am splitting up the writes into separate hdf5 files. Were I writing only one file, it would be easier. As far as the dimensions of the arrays you see here, don't take them too seriously, this was from a run on another machine. I am holding off on kraken until I can get at least a decent idea of what to try to improve I/O. >How many timesteps are going into each file? Only one. > > Quincey > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > -- Leigh Orf Associate Professor of Atmospheric Science Department of Geology and Meteorology Central Michigan University Currently on sabbatical at the National Center for Atmospheric Research in Boulder, CO NCAR office phone: (303) 497-8200
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
