Re: [Lustre-discuss] lfs setstripe stripe_size optimization

Nathaniel Rutman Tue, 24 Jul 2007 14:10:07 -0700

Marty Barnaby wrote:

Nathaniel Rutman wrote:
Marty Barnaby wrote:
I'm attempting to establish an absolute maximum, byte-rateperformance value, running a bare bones MPI_File_write_all_atbenchmark program, for our Cray XT3 installation, RedStorm, here atSandia National Laboratories. Processor time is at a premium, and Ionly run in the standard queue, so I'm not able to do everything Iwould imagine, though maybe what I can run is adequate.
I have a directory under our Lustre, redstorm:/scratch_grande whichI have defined with:
lfs setstripe -1 0 -1
Though there are 320 OST's comprising the FS, these defaults give mea stripe_count of 160 (I'm sure someone could explain that), and Idon't know the stripe_size. With a job of 160 processors, each ofwhich has a contiguous chuck of 20 MB of memory, respectively, toappend to an open file in an iterative series of singular, atomic,write_all operations, I can normally average 25 GB/s. To curb anyconfusion here, that represents only an experimental maximum to me;none of our many, complex, science and engineering simulationapplications perform their output dumping with per-processor blocksas large as a single MB.
I would like any succinct suggestions on explicitly setting my lfsstripe_size, given the configuration and parameters I've mentionedhere, to optimize it and, perhaps see a decrease in the time spentstoring my data on the FS.
Try setting your stripesize to 20MB. As Kalpak mentioned, wecurrently have a limit of 160 OSTs for any 1 file (although, ofcourse, there are plans to remove this limitation soon).
Would you mind posting your test prog? I can imagine others (besidesme) might be interested in such experimental maximums.
I made a new directory and set the parameters as you suggest. Iverified them by touching a file, then checking it with 'lfs getstripe-v', which returned:
/scratch_grande/mlbarna/max.ss20MB/t
lmm_magic:          0x0BD10BD0
lmm_object_gr:      0
lmm_object_id:      0x501819f
lmm_stripe_count:   160
lmm_stripe_size:    20971520
lmm_stripe_pattern: 1
Since yesterday, I got a dozen runs each, respectively, for thedefault stripe_size directory (which is 2 MB), and the new, 20 MBstripe_size, with the allocation of 160 client processes. The 20 MBstripe_size case was slower by 10%. Am I mistaken in my understandingthat, in the faster case, that I've run many times this week, theindividual, per-processor writes of 20 MB, are distributed across 10of the 2 MB stripe_size, of 160 OST in the stripe_count of my file?

Correct

If so, and correct my math or my understanding if I am not seeing thisright, each OST is responding to write requests in oneMPI_File_write_at_all operation, from 10 seperate clients, and doingit faster than when there is just one client where the OST stripe_sizeis set up at 20 MB.

Kind of a series vs parallel thing. 1 client writing to 10 OSTs 10times in series, or 10 clients writing to 1 OST each in parallel. Ifyou're still tweaking, you might try numbers in between - 1MB stripesize, 4MB, 8MB, 10MB. Might be fun to plot.

For my benchmark program, I am running a fairly simple item, left overas a legacy from when the DOE ASCI had a project to create a standarddata format. The package became know as SAF (Sets and Fields), and thelower layers were HDF5, and, ultimately, MPI-IO in collective, _all,calls. A few years ago, in a vie to increase throughput, the HDF5project implemented an optional posix, virtual file driver, because itwas imagined the MPI-IO might be an impediment.
Rob Matzke authored the testing client, which he named 'rb', with sometricks I like to leverage for the various approaches for which I haveagenda; including the ability to choose the layer, which can be HDF5via MPI-IO or the posix virtual file driver, MPI-IO write_all, andalso plain posix, with a simple design for collective, stridedwriting. For the main activity, it is a plain loop, iterating --nrecs<number> of times with a --record <size> buffer per processor, callingthe API routine appropriate to the library level chosen.
In my community, IOR is the most commonly accepted benchmark program.Though rb isn't as transparent as some of the absolutely,johnny-one-note executables I have created for my own information inthe past, I find IOR overly convoluted. I have seen several caseswhere people ran IOR without really understanding all the parametersthey were getting.
How would I post 'rb'? I could merely send you a .tar.gz?

If it's not huge, post it to the list?  Or we could stick it on a wiki page.

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] lfs setstripe stripe_size optimization

Reply via email to