Marty Barnaby wrote:
Nathaniel Rutman wrote:
Marty Barnaby wrote:
I'm attempting to establish an absolute maximum, byte-rate
performance value, running a bare bones MPI_File_write_all_at
benchmark program, for our Cray XT3 installation, RedStorm, here at
Sandia National Laboratories. Processor time is at a premium, and I
only run in the standard queue, so I'm not able to do everything I
would imagine, though maybe what I can run is adequate.
I have a directory under our Lustre, redstorm:/scratch_grande which
I have defined with:
lfs setstripe -1 0 -1
Though there are 320 OST's comprising the FS, these defaults give me
a stripe_count of 160 (I'm sure someone could explain that), and I
don't know the stripe_size. With a job of 160 processors, each of
which has a contiguous chuck of 20 MB of memory, respectively, to
append to an open file in an iterative series of singular, atomic,
write_all operations, I can normally average 25 GB/s. To curb any
confusion here, that represents only an experimental maximum to me;
none of our many, complex, science and engineering simulation
applications perform their output dumping with per-processor blocks
as large as a single MB.
I would like any succinct suggestions on explicitly setting my lfs
stripe_size, given the configuration and parameters I've mentioned
here, to optimize it and, perhaps see a decrease in the time spent
storing my data on the FS.
Try setting your stripesize to 20MB. As Kalpak mentioned, we
currently have a limit of 160 OSTs for any 1 file (although, of
course, there are plans to remove this limitation soon).
Would you mind posting your test prog? I can imagine others (besides
me) might be interested in such experimental maximums.
I made a new directory and set the parameters as you suggest. I
verified them by touching a file, then checking it with 'lfs getstripe
-v', which returned:
/scratch_grande/mlbarna/max.ss20MB/t
lmm_magic: 0x0BD10BD0
lmm_object_gr: 0
lmm_object_id: 0x501819f
lmm_stripe_count: 160
lmm_stripe_size: 20971520
lmm_stripe_pattern: 1
Since yesterday, I got a dozen runs each, respectively, for the
default stripe_size directory (which is 2 MB), and the new, 20 MB
stripe_size, with the allocation of 160 client processes. The 20 MB
stripe_size case was slower by 10%. Am I mistaken in my understanding
that, in the faster case, that I've run many times this week, the
individual, per-processor writes of 20 MB, are distributed across 10
of the 2 MB stripe_size, of 160 OST in the stripe_count of my file?
Correct
If so, and correct my math or my understanding if I am not seeing this
right, each OST is responding to write requests in one
MPI_File_write_at_all operation, from 10 seperate clients, and doing
it faster than when there is just one client where the OST stripe_size
is set up at 20 MB.
Kind of a series vs parallel thing. 1 client writing to 10 OSTs 10
times in series, or 10 clients writing to 1 OST each in parallel. If
you're still tweaking, you might try numbers in between - 1MB stripe
size, 4MB, 8MB, 10MB. Might be fun to plot.
For my benchmark program, I am running a fairly simple item, left over
as a legacy from when the DOE ASCI had a project to create a standard
data format. The package became know as SAF (Sets and Fields), and the
lower layers were HDF5, and, ultimately, MPI-IO in collective, _all,
calls. A few years ago, in a vie to increase throughput, the HDF5
project implemented an optional posix, virtual file driver, because it
was imagined the MPI-IO might be an impediment.
Rob Matzke authored the testing client, which he named 'rb', with some
tricks I like to leverage for the various approaches for which I have
agenda; including the ability to choose the layer, which can be HDF5
via MPI-IO or the posix virtual file driver, MPI-IO write_all, and
also plain posix, with a simple design for collective, strided
writing. For the main activity, it is a plain loop, iterating --nrecs
<number> of times with a --record <size> buffer per processor, calling
the API routine appropriate to the library level chosen.
In my community, IOR is the most commonly accepted benchmark program.
Though rb isn't as transparent as some of the absolutely,
johnny-one-note executables I have created for my own information in
the past, I find IOR overly convoluted. I have seen several cases
where people ran IOR without really understanding all the parameters
they were getting.
How would I post 'rb'? I could merely send you a .tar.gz?
If it's not huge, post it to the list? Or we could stick it on a wiki page.
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss