Re: [Hdf-forum] Serial writes to many files at once on lustre file system

Quincey Koziol Mon, 07 Mar 2011 11:49:50 -0800

Hi Leigh,

On Mar 4, 2011, at 3:48 PM, Leigh Orf wrote:


> This is somewhat related to my earlier queries about pHDF5 performance
> on a lustre filesystem, but different enough to merit a new thread.
> 
> The cloud model I am using outputs two main types of 3D data:
> snapshots and restart files.
> 
> The snapshot files are used for post-processing, visualization,
> analysis etc. and I am using pHDF5 to write few (maybe even 1) file
> containing 30,000 ranks of data. I am still working on testing this.
> 
> However, restart files (sometimes called checkpoint files), which
> contain the minimum amount of data required to start the model up from
> a particular state in time, are disposable and hence my main concern
> with restart files is they be written and read as quickly as possible.
> But I don't care what format they're in.
> 
> Because of issues with ghost zones and other factors which make arrays
> overlap in model space, it is much simpler to have each rank write its
> own restart file rather than trying to merge them together using
> pHDF5. I started going down the pHDF5 route and decided it wasn't
> worth it.
> 
> Currently, the model defaults to one file per rank, but uses
> unformatted fortran writes, i.e.:
> 
> open(unit=50,file=trim(filename),form='unformatted',status='unknown')
>       write(50) ua
>       write(50) va
>       write(50) wa
>       write(50) ppi
>       write(50) tha
> 
> etc. where each array is 3d (some 1d arrays are written as well).

        Yow!  Very 70's... :-)

> With the current configuration I have, each restart file is approximately 
> 12MB.
> 
> After reading through what literature I could find on lustre, I
> decided that I would write no more than 3,000 files at one time, have
> 3,000 files per unique directory, and that I would set the stripe
> count (the number of OSTs to stripe over) to 1. I set the strip size
> to 32 MB which, in retrospect, was probably not an ideal choice given
> each restart file size.
> 
> With this configuration, I wrote 353 GB of data, spanning 30,000
> files, in about 6 minutes, getting an effective write bandwidth of
> 1.09 GB/s. A second try got better performance for no obvious reason,
> getting 2.8 GB/s. However this is still much lower than ~ 10 GB/s
> which from http://www.nics.tennessee.edu/io-tips is the maximum
> (presumably aligned) expected performance.
> 
> I am assuming I am getting less than optimal results primarily because
> the writes are unaligned. This brings me to my question: Should I
> bother to rewrite the checkpoint writing/reading code using hdf5 in
> order to increase performance? I understand with pHDF5 and collective
> I/O, it automatically does aligned writes, presumably being able to
> detect the strip size on the lustre filesystem (is this true?).

        Unless the MPI-IO layer is doing this, HDF5 doesn't do this by default.

> With serial HDF5, I see there is a H5Pset_alignment command. I also
> assume that with serial HDF5, I would need to manually set the
> alignment, as it defaults to unaligned writes. Would I benefit from
> using H5Pset_alignment to the stripe size on the lustre filesystem?

        Yes, almost certainly.  This is one of the ways that Mark Howison and I 
worked out to improve the I/O performance on the NERSC machines.

> My arrays are roughly 28x21x330 with some slight variation. 16 these 4
> byte floating point arrays are written, giving approximately 12 MB per
> file.
> 
> So, as a rough guess, I am thinking of trying the following:
> 
> Set stripe size to 4 MB (4194304 bytes)
> 
> Try something like:
> 
> H5Pset_alignment(fapl, 1000, 4194304)
> 
> (I didn't set the second argument to 0 because I really don't want to
> align my 4 byte integers etc, that comprise some of the restart data,
> right?)

        Yes, that looks fine.

> Chunk in Z only, so my chunk dimensions would be something like
> 28x21x30 (it's never been clear to me what chunk size to pick to
> optimize I/O).
> 
> And keep the other parameters the same (1 stripe, and 3,000 files per
> directory).
> 
> I guess what I'm mostly looking for is assurance that I will get
> faster I/O going down this kind of route than the current way I am
> doing unformatted I/O.

        This looks like a fruitful direction to go it.  Do you really need 
chunking though?

        Quincey


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Serial writes to many files at once on lustre file system

Reply via email to