On Thu, May 17, 2012 at 03:02:48PM -0600, Martin Pokorny wrote: > Hi, everyone; > > I've been using MPI-IO on a Lustre file system to good effect for a > while now in an application that has up to 32 processes writing to a > shared file. However, seeking to understand the performance of our > system, and improve on it, I've recently made some changes to the > ADIO Lustre code, which show some promise, but need more testing. > Eventually, I'd like to submit the code changes back to the mpich2 > project, but that is certainly contingent upon the results of > testing (and various code compliance issues for mpich2/romio/adio > that I will likely need to sort out.) This message is my request for > volunteers to help test my code, in particular for output file > correctness and shared-file write performance. If you're interested > in doing shared file I/O using MPI-IO on Lustre, please continue > reading this message.
Gosh, Martin, I really thought you'd get more attention with this post. I'd like to see these patches: I can't aggressively test them on a lustre system but I'd be happy to provide another set of ROMIO-eyeballs. > In broad terms, the changes I made are on two fronts: changing the > file domain partitioning algorithm, and introducing non-blocking > operations at several points. Non-blocking communication or i/o ? One concern with non-blocking I/O in this path is that often the communication and I/O networks are the same thing (e.g. infiniband, or the BlueGene tree network in some situations). > The file domain partitioning algorithm > that I implemented is from the paper "Dynamically Adapting File > Domain Partitioning Methods for Collective I/O Based on Underlying > Parallel File System Locking Protocols" by Wei-keng Liao and Alok > Choudhary. The non-blocking operations that I added allow the ADIO > Lustre driver better to parallelize the data exchange and writing > procedures over multiple stripes within each process writing to one > Lustre OST, I was hoping Wei-keng would chime in on this. I'll be sure to draw your patches to his attention. > My testing so far has been limited to four nodes, up to sixteen > processes, writing to shared files on a Lustre file system with up > to eight OSTs. Right now the only concern I have is that you may (and without looking at the code I have no way of knowing) traded better small-scale performance for worse large-scale performance. > These tests were conducted to simulate the production > application for which I'm responsible, but on a different cluster, > focused only on the file output. In these rather limited tests, I've > seen write performance gains of up to a factor of two or three. The > new file domain partitioning algorithm is most effective when the > number of processes exceeds the number of Lustre OSTs, but there are > smaller gains in other cases, and I have not seen instance in which > the performance has decreased. As an example, in one case using > sixteen processes, MPI over Infiniband, and a file striping factor > of four, the new code achieves over 800 MB/s, whereas the standard > code achieves 300 MB/s. I have hints that the relative performance > gains when using a 1Gb Ethernet rather than Infiniband for MPI > message passing are greater, but I have not completed my testing in > that environment. > > If you're willing to try out this code in a test environment please > let me know. I have not yet put the code into a publicly accessible > repository, but will do so if there is interest out there. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
