Some background before I get to the problem: I am recently attempting the largest simulations I have ever done, so this is uncharted territory for me. I am running on the kraken teragrid resource. The application is a 3D cloud model, and the output consists mostly of 2D and 3D floating point fields.
Each MPI rank runs on a core. I am not using any OpenMP/threads. This is not an option right now with the way the model is written. The full problem size is 3300x3000x350 and I'm using a 2D parallel decomposition, dividing the problem into 30,000 ranks (150x200 ranks, with each rank having 22x15x350 points). This type of geometry is likely what we are 'stuck' with unless we go with a 3D parallel decomposition, and that is not an attractive option. I have created a few different MPI communicators to handle I/O. The model writes one single hdf5 file full of 2D and 1D floating point data, as well as a tiny bit of metadata in the form of integers and attributes (I will call this the 2D file). The 2D file is accessed through the MPI_COMM_WORLD communicator - so each of the 30,000 ranks writes to this file. I would prefer not to split this 2D file (which is about 1 GB in size) up, as it's used for a quick look at how the simulation is progressing, and can be visualized directly with software I wrote. For this file, each rank is writing a 22x15 'patch' of floating point data for each field. With the files containing the 3D floating point arrays (call them the 3D files), I have it set up such that a flexible number of ranks can each write to a HDF5 file, so long as the numbers divide evenly into the full problem. For instance, I currently have it set up such that each 3D HDF5 file is written by 15x20 (300) ranks and therefore a total of 100 3D HDF5 files are written for a history dump. So each file contains 3D arrays of size 330x300x330. Hence, these 3D hdf5 files are using a different communicator than MPI_COMM_WORLD that I assemble before any I/O occurs. The 2D and 3D files are written at the same time (within the same routine). For each field, I either write 2D and 3D data, or just 2D data. I can turn off writing the 3D data and just write the 2D data, but not the other way around (I could change this and may do so). I currently have a run in the queue where only 2D data is written so I can determine whether the bottleneck is with that file as opposed to the 3D files. The problem I am having is abysmal I/O performance, and I am hoping that maybe I can get some pointers. I fully realize that the lustre file system on the kraken teragrid machine is not perfect and has its quirks. However, after 10 minutes of writing the 2D file and the 3D files, I had only output about 10 GB of data. Questions: 1. Should I expect poor performance with 30,000 cores writing tiny 2D patches to one file? I have considered creating another communicator and doing MPI_GATHER on this communicator, reassembling the 2D data, and then opening the 2D file using the communicator - this way fewer ranks would be accessing at once. Since I am not familiar with the internals of parallelHDF5, I don't know if doing that is necessary or recommended. 2. Since I have flexibility with the number of 3D files, should I create fewer? More? 3. There is a command (lfs) on kraken which controls striping patterns. Could I perhaps see better performance by mucking with striping? I have looked through http://www.nics.tennessee.edu/io-tips "I/O Tips - Lustre Striping and Parallel I/O" but did not come back with any clear message about how I should modify the default settings. 4. I am doing collective writes (H5FD_MPIO_COLLECTIVE). Should I try independent (H5FD_MPIO_INDEPENDENT)? Since I am unsure where the bottleneck is, I'm asking the hdf5 list first, and as I understand it some of the folks here are familiar with the kraken resoruce and have used parallel HDF5 with very large numbers of ranks. Any tips or suggestions for how to wrestle this problem are greatly appreciated. Thanks, Leigh -- Leigh Orf Associate Professor of Atmospheric Science Department of Geology and Meteorology Central Michigan University Currently on sabbatical at the National Center for Atmospheric Research in Boulder, CO NCAR office phone: (303) 497-8200
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
