David, You are right, there is a lock. As Patrick mentioned, https://jira.hpdd.intel.com/browse/LU-1669 will solve your problems. Please check it out.
In my own experience, Lustre 2.7.0 client does solve such problem very well, and I got a very good performance so far. Regards, Cuong On Wed, May 20, 2015 at 4:46 AM, David A. Schneider < [email protected]> wrote: > We do use checksums, but can't turn it off. It know we've measured some > performance penalty with checksums. I'll check about configuring lustre > clients to to use RDMA. We ran into something similar where our MPI > programs were not taking advantage of the infini-band - we noticed much > slower message passing then we expected - it sounds like there is a similar > thing we can do with lustre, but I guess the locking is the main issue. All > our compute nodes are currently running red hat 5 and it doesn't look like > lustre 2.6 was tested with rhel5, but we have been talking about moving > everything to at least rhel6, maybe rhel7, so there's hope, Thanks for the > help! > > best, > > David > > > On 05/19/15 11:10, Patrick Farrell wrote: > >> Ah. I think I know what¹s going on here: >> >> In Lustre 2.x client versions prior to 2.6, only one process on a given >> client can write to a given file at a time, regardless of how the file is >> striped. So if you are writing to the same file, there will be little to >> no benefit of putting an extra process on the same node. >> >> A *single* process on a node could benefit, but not the split you¹ve >> described. >> >> The details, which are essentially just that a pair of per-file locks are >> used by any individual process writing to a file, are here: >> https://jira.hpdd.intel.com/browse/LU-1669 >> >> >> On 5/19/15, 12:59 PM, "Mohr Jr, Richard Frank (Rick Mohr)" <[email protected] >> > >> wrote: >> >> On May 19, 2015, at 1:44 PM, Schneider, David A. >>>> <[email protected]> wrote: >>>> >>>> Thanks for the suggestion! When I had each rank run on a separate >>>> compute node/host, I saw parallel performance (4 seconds for the 6GB of >>>> writing). When I ran the MPI job on one host (the hosts have 12 cores, >>>> by default we pack ranks onto as few hosts as possible), things happened >>>> serially, each rank finished about 2 seconds after a different rank. >>>> >>> Hmm. That does seem like there is some bottleneck on the client side that >>> is limiting the throughput from a single client. Here are some things >>> you could look into (although they might require more tinkering than you >>> have permission to do): >>> >>> 1) Based on your output from ³lctl list_nids², it looks like you are >>> running IP-over-IB. Can you configure the clients to use RDMA? (They >>> would have nids like x.x.x.x@o2ib.) >>> >>> 2) Do you have the option of trying a newer client version? Earlier >>> lustre versions used a single-thread ptlrpcd to manage network traffic, >>> but newer versions have a multi-threaded implementation. You may need to >>> compare compatibility with the Lustre version running on the servers >>> though. >>> >>> 3) Do you gave checksums disabled? Try running "lctl get_param >>> osc.*.checksums². If the values are ³1², then checksums are enabled >>> which can slow down performance. You could try setting the value to ³0² >>> to see if that helps. >>> >>> -- >>> Rick Mohr >>> Senior HPC System Administrator >>> National Institute for Computational Sciences >>> http://www.nics.tennessee.edu >>> >>> _______________________________________________ >>> lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>> >> > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > -- Nguyen Viet Cuong
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
