Hi, everyone; I've been using MPI-IO on a Lustre file system to good effect for a while now in an application that has up to 32 processes writing to a shared file. However, seeking to understand the performance of our system, and improve on it, I've recently made some changes to the ADIO Lustre code, which show some promise, but need more testing. Eventually, I'd like to submit the code changes back to the mpich2 project, but that is certainly contingent upon the results of testing (and various code compliance issues for mpich2/romio/adio that I will likely need to sort out.) This message is my request for volunteers to help test my code, in particular for output file correctness and shared-file write performance. If you're interested in doing shared file I/O using MPI-IO on Lustre, please continue reading this message.
In broad terms, the changes I made are on two fronts: changing the file domain partitioning algorithm, and introducing non-blocking operations at several points. The file domain partitioning algorithm that I implemented is from the paper "Dynamically Adapting File Domain Partitioning Methods for Collective I/O Based on Underlying Parallel File System Locking Protocols" by Wei-keng Liao and Alok Choudhary. The non-blocking operations that I added allow the ADIO Lustre driver better to parallelize the data exchange and writing procedures over multiple stripes within each process writing to one Lustre OST, My testing so far has been limited to four nodes, up to sixteen processes, writing to shared files on a Lustre file system with up to eight OSTs. These tests were conducted to simulate the production application for which I'm responsible, but on a different cluster, focused only on the file output. In these rather limited tests, I've seen write performance gains of up to a factor of two or three. The new file domain partitioning algorithm is most effective when the number of processes exceeds the number of Lustre OSTs, but there are smaller gains in other cases, and I have not seen instance in which the performance has decreased. As an example, in one case using sixteen processes, MPI over Infiniband, and a file striping factor of four, the new code achieves over 800 MB/s, whereas the standard code achieves 300 MB/s. I have hints that the relative performance gains when using a 1Gb Ethernet rather than Infiniband for MPI message passing are greater, but I have not completed my testing in that environment. If you're willing to try out this code in a test environment please let me know. I have not yet put the code into a publicly accessible repository, but will do so if there is interest out there. -- Martin Pokorny Software Engineer - Jansky Very Large Array National Radio Astronomy Observatory - New Mexico Operations _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
