On 10/18/2010 2:58 PM, John White wrote: > On Oct 18, 2010, at 10:49 AM, Peter Kjellstrom wrote: > >> On Monday 18 October 2010, John White wrote: >>> Hello Folks, >>> A while back (say 3 weeks ago) we started noticing extremely high loads >>> (load avg around 300 at times) on our OSSs when in production and serving >>> IO. This cluster was, at the time, on 1.8.2 (we have since upgraded to >>> 1.8.4 but the problem remains). The load increases fairly predictably as >>> clients generate IO but even 2 clients can produce a load avg above 5.00. >> Does this impact performance or does it only show up as an unexpectedly high >> number on the OSSes? > We have gotten reports of scaling issues that we had not experienced prior to > this issue cropping up. Throughput is certainly less predictable than before > but we are able to hit the same peaks. > >> /Peter >> >>> An identical file system of ours does not exhibit this behavior (sticks >>> below load avg 1.00 under even the heaviest IO load). I've looked around >>> bugzilla and haven't found anything. We've disabled heartbeat on the >>> off-chance that was generating the load (it's not), we've attempted using a >>> different client transport (o2ib->tcp), this did not solve the issue. >>> There doesn't appear to be any specific non-kernel thread causing the >>> high-load. The only info in dmesg/syslog pertains to sporadic client >>> evictions or sporadic slow setattr due to heavy IO load (we've since tuned >>> the number of OST threads). We're basically out of ideas to try. >>> >>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet >>> (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for >>> 1.8.4. The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000. Is this >>> a common problem or should a bug be filed? Any info available upon >>> request. Thanks for your time. ---------------- >>> John White >>> High Performance Computing Services (HPCS) >>> (510) 486-7307 >>> One Cyclotron Rd, MS: 50B-3209C >>> Lawrence Berkeley National Lab >>> Berkeley, CA 94720 >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > ---------------- > John White > High Performance Computing Services (HPCS) > (510) 486-7307 > One Cyclotron Rd, MS: 50B-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss You should examine you kernel I/O scheduler. The deadline scheduler sometimes help in these kinds of circumstances.
~Lawrence _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
