> On May 19, 2015, at 1:44 PM, Schneider, David A. <[email protected]> > wrote: > > Thanks for the suggestion! When I had each rank run on a separate compute > node/host, I saw parallel performance (4 seconds for the 6GB of writing). > When I ran the MPI job on one host (the hosts have 12 cores, by default we > pack ranks onto as few hosts as possible), things happened serially, each > rank finished about 2 seconds after a different rank.
Hmm. That does seem like there is some bottleneck on the client side that is limiting the throughput from a single client. Here are some things you could look into (although they might require more tinkering than you have permission to do): 1) Based on your output from “lctl list_nids”, it looks like you are running IP-over-IB. Can you configure the clients to use RDMA? (They would have nids like x.x.x.x@o2ib.) 2) Do you have the option of trying a newer client version? Earlier lustre versions used a single-thread ptlrpcd to manage network traffic, but newer versions have a multi-threaded implementation. You may need to compare compatibility with the Lustre version running on the servers though. 3) Do you gave checksums disabled? Try running "lctl get_param osc.*.checksums”. If the values are “1”, then checksums are enabled which can slow down performance. You could try setting the value to “0” to see if that helps. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
