Re: [Hdf-forum] Serial writes to many files at once on lustre file system

Mark Howison Tue, 08 Mar 2011 06:34:09 -0800

Hi Leigh, I'm not sure of the origin of the "3000" limit, but it is
true that if you are going to write file-per-processor on lustre, it
can be much better if you use a "sqrt(n)" file layout, meaning you
create sqrt(n) directories and fill each with sqrt(n) files. This
tends to alleviate pressure on the metadata server, which is often the
bottleneck when working with large numbers of files on a lustre FS.
Mark


On Fri, Mar 4, 2011 at 4:48 PM, Leigh Orf <[email protected]> wrote:
> This is somewhat related to my earlier queries about pHDF5 performance
> on a lustre filesystem, but different enough to merit a new thread.
>
> The cloud model I am using outputs two main types of 3D data:
> snapshots and restart files.
>
> The snapshot files are used for post-processing, visualization,
> analysis etc. and I am using pHDF5 to write few (maybe even 1) file
> containing 30,000 ranks of data. I am still working on testing this.
>
> However, restart files (sometimes called checkpoint files), which
> contain the minimum amount of data required to start the model up from
> a particular state in time, are disposable and hence my main concern
> with restart files is they be written and read as quickly as possible.
> But I don't care what format they're in.
>
> Because of issues with ghost zones and other factors which make arrays
> overlap in model space, it is much simpler to have each rank write its
> own restart file rather than trying to merge them together using
> pHDF5. I started going down the pHDF5 route and decided it wasn't
> worth it.
>
> Currently, the model defaults to one file per rank, but uses
> unformatted fortran writes, i.e.:
>
> open(unit=50,file=trim(filename),form='unformatted',status='unknown')
>       write(50) ua
>       write(50) va
>       write(50) wa
>       write(50) ppi
>       write(50) tha
>
> etc. where each array is 3d (some 1d arrays are written as well).
>
> With the current configuration I have, each restart file is approximately 
> 12MB.
>
> After reading through what literature I could find on lustre, I
> decided that I would write no more than 3,000 files at one time, have
> 3,000 files per unique directory, and that I would set the stripe
> count (the number of OSTs to stripe over) to 1. I set the strip size
> to 32 MB which, in retrospect, was probably not an ideal choice given
> each restart file size.
>
> With this configuration, I wrote 353 GB of data, spanning 30,000
> files, in about 6 minutes, getting an effective write bandwidth of
> 1.09 GB/s. A second try got better performance for no obvious reason,
> getting 2.8 GB/s. However this is still much lower than ~ 10 GB/s
> which from http://www.nics.tennessee.edu/io-tips is the maximum
> (presumably aligned) expected performance.
>
> I am assuming I am getting less than optimal results primarily because
> the writes are unaligned. This brings me to my question: Should I
> bother to rewrite the checkpoint writing/reading code using hdf5 in
> order to increase performance? I understand with pHDF5 and collective
> I/O, it automatically does aligned writes, presumably being able to
> detect the strip size on the lustre filesystem (is this true?).
>
> With serial HDF5, I see there is a H5Pset_alignment command. I also
> assume that with serial HDF5, I would need to manually set the
> alignment, as it defaults to unaligned writes. Would I benefit from
> using H5Pset_alignment to the stripe size on the lustre filesystem?
>
> My arrays are roughly 28x21x330 with some slight variation. 16 these 4
> byte floating point arrays are written, giving approximately 12 MB per
> file.
>
> So, as a rough guess, I am thinking of trying the following:
>
> Set stripe size to 4 MB (4194304 bytes)
>
> Try something like:
>
> H5Pset_alignment(fapl, 1000, 4194304)
>
> (I didn't set the second argument to 0 because I really don't want to
> align my 4 byte integers etc, that comprise some of the restart data,
> right?)
>
> Chunk in Z only, so my chunk dimensions would be something like
> 28x21x30 (it's never been clear to me what chunk size to pick to
> optimize I/O).
>
> And keep the other parameters the same (1 stripe, and 3,000 files per
> directory).
>
> I guess what I'm mostly looking for is assurance that I will get
> faster I/O going down this kind of route than the current way I am
> doing unformatted I/O.
>
> Thanks as always,
>
> Leigh
>
> --
> Leigh Orf
> Associate Professor of Atmospheric Science
> Department of Geology and Meteorology
> Central Michigan University
> Currently on sabbatical at the National Center for Atmospheric
> Research in Boulder, CO
> NCAR office phone: (303) 497-8200
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Serial writes to many files at once on lustre file system

Reply via email to