This is somewhat related to my earlier queries about pHDF5 performance on a lustre filesystem, but different enough to merit a new thread.
The cloud model I am using outputs two main types of 3D data: snapshots and restart files. The snapshot files are used for post-processing, visualization, analysis etc. and I am using pHDF5 to write few (maybe even 1) file containing 30,000 ranks of data. I am still working on testing this. However, restart files (sometimes called checkpoint files), which contain the minimum amount of data required to start the model up from a particular state in time, are disposable and hence my main concern with restart files is they be written and read as quickly as possible. But I don't care what format they're in. Because of issues with ghost zones and other factors which make arrays overlap in model space, it is much simpler to have each rank write its own restart file rather than trying to merge them together using pHDF5. I started going down the pHDF5 route and decided it wasn't worth it. Currently, the model defaults to one file per rank, but uses unformatted fortran writes, i.e.: open(unit=50,file=trim(filename),form='unformatted',status='unknown') write(50) ua write(50) va write(50) wa write(50) ppi write(50) tha etc. where each array is 3d (some 1d arrays are written as well). With the current configuration I have, each restart file is approximately 12MB. After reading through what literature I could find on lustre, I decided that I would write no more than 3,000 files at one time, have 3,000 files per unique directory, and that I would set the stripe count (the number of OSTs to stripe over) to 1. I set the strip size to 32 MB which, in retrospect, was probably not an ideal choice given each restart file size. With this configuration, I wrote 353 GB of data, spanning 30,000 files, in about 6 minutes, getting an effective write bandwidth of 1.09 GB/s. A second try got better performance for no obvious reason, getting 2.8 GB/s. However this is still much lower than ~ 10 GB/s which from http://www.nics.tennessee.edu/io-tips is the maximum (presumably aligned) expected performance. I am assuming I am getting less than optimal results primarily because the writes are unaligned. This brings me to my question: Should I bother to rewrite the checkpoint writing/reading code using hdf5 in order to increase performance? I understand with pHDF5 and collective I/O, it automatically does aligned writes, presumably being able to detect the strip size on the lustre filesystem (is this true?). With serial HDF5, I see there is a H5Pset_alignment command. I also assume that with serial HDF5, I would need to manually set the alignment, as it defaults to unaligned writes. Would I benefit from using H5Pset_alignment to the stripe size on the lustre filesystem? My arrays are roughly 28x21x330 with some slight variation. 16 these 4 byte floating point arrays are written, giving approximately 12 MB per file. So, as a rough guess, I am thinking of trying the following: Set stripe size to 4 MB (4194304 bytes) Try something like: H5Pset_alignment(fapl, 1000, 4194304) (I didn't set the second argument to 0 because I really don't want to align my 4 byte integers etc, that comprise some of the restart data, right?) Chunk in Z only, so my chunk dimensions would be something like 28x21x30 (it's never been clear to me what chunk size to pick to optimize I/O). And keep the other parameters the same (1 stripe, and 3,000 files per directory). I guess what I'm mostly looking for is assurance that I will get faster I/O going down this kind of route than the current way I am doing unformatted I/O. Thanks as always, Leigh -- Leigh Orf Associate Professor of Atmospheric Science Department of Geology and Meteorology Central Michigan University Currently on sabbatical at the National Center for Atmospheric Research in Boulder, CO NCAR office phone: (303) 497-8200 _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
