Rob, I've been in email contact with Wei-keng Liao about the changes that I've made. We have mainly discussed my implementation of the new non-blocking code; it turns out that he is currently working on a very similar set of modifications. The file domain partitioning algorithm comes from Wei-keng, and I expect that the results he published should apply to my implementation, at least approximately. He has encouraged me to do some large-scale testing using XSEDE, and I have plans to run some benchmarks in that setting soon.
A few more comments, below. Rob Latham wrote: > On Thu, May 17, 2012 at 03:02:48PM -0600, Martin Pokorny wrote: >> Hi, everyone; >> >> I've been using MPI-IO on a Lustre file system to good effect for a >> while now in an application that has up to 32 processes writing to a >> shared file. However, seeking to understand the performance of our >> system, and improve on it, I've recently made some changes to the >> ADIO Lustre code, which show some promise, but need more testing. >> Eventually, I'd like to submit the code changes back to the mpich2 >> project, but that is certainly contingent upon the results of >> testing (and various code compliance issues for mpich2/romio/adio >> that I will likely need to sort out.) This message is my request for >> volunteers to help test my code, in particular for output file >> correctness and shared-file write performance. If you're interested >> in doing shared file I/O using MPI-IO on Lustre, please continue >> reading this message. > > Gosh, Martin, I really thought you'd get more attention with this > post. Me too. > I'd like to see these patches: I can't aggressively test them on a > lustre system but I'd be happy to provide another set of > ROMIO-eyeballs. I can send them to you now, if you want. However, I'm not finished testing and can't rule out further changes. I will go ahead and put them somewhere publicly accessible, and let everyone know when it's done. >> In broad terms, the changes I made are on two fronts: changing the >> file domain partitioning algorithm, and introducing non-blocking >> operations at several points. > > Non-blocking communication or i/o ? > > One concern with non-blocking I/O in this path is that often the > communication and I/O networks are the same thing (e.g. infiniband, or > the BlueGene tree network in some situations). Non-blocking in both communication and I/O. I was preparing a question to the Lustre discussion list about non-blocking I/O using the POSIX aio API. I'll just ask right here, then. Is POSIX aio on a Lustre file system truly asynchronous? I expect that perhaps the implementation of aio in glibc may be asynchronous w.r.t. the calling thread, but I also wonder whether system calls to the Lustre client are asynchronous or not. Can anyone help me understand? I have a little data suggesting that the aio calls do improve performance a bit, but this is a tentative conclusion. >> The file domain partitioning algorithm >> that I implemented is from the paper "Dynamically Adapting File >> Domain Partitioning Methods for Collective I/O Based on Underlying >> Parallel File System Locking Protocols" by Wei-keng Liao and Alok >> Choudhary. The non-blocking operations that I added allow the ADIO >> Lustre driver better to parallelize the data exchange and writing >> procedures over multiple stripes within each process writing to one >> Lustre OST, > > I was hoping Wei-keng would chime in on this. I'll be sure to draw > your patches to his attention. I've already done that. >> My testing so far has been limited to four nodes, up to sixteen >> processes, writing to shared files on a Lustre file system with up >> to eight OSTs. > > Right now the only concern I have is that you may (and without looking > at the code I have no way of knowing) traded better small-scale > performance for worse large-scale performance. Right. As I mentioned above I will soon be testing my code in a large-scale setting. >> These tests were conducted to simulate the production >> application for which I'm responsible, but on a different cluster, >> focused only on the file output. In these rather limited tests, I've >> seen write performance gains of up to a factor of two or three. The >> new file domain partitioning algorithm is most effective when the >> number of processes exceeds the number of Lustre OSTs, but there are >> smaller gains in other cases, and I have not seen instance in which >> the performance has decreased. As an example, in one case using >> sixteen processes, MPI over Infiniband, and a file striping factor >> of four, the new code achieves over 800 MB/s, whereas the standard >> code achieves 300 MB/s. I have hints that the relative performance >> gains when using a 1Gb Ethernet rather than Infiniband for MPI >> message passing are greater, but I have not completed my testing in >> that environment. >> >> If you're willing to try out this code in a test environment please >> let me know. I have not yet put the code into a publicly accessible >> repository, but will do so if there is interest out there. > > ==rob > -- Martin _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
