On 10/18/2010 2:58 PM, John White wrote:
> On Oct 18, 2010, at 10:49 AM, Peter Kjellstrom wrote:
>
>> On Monday 18 October 2010, John White wrote:
>>> Hello Folks,
>>>     A while back (say 3 weeks ago) we started noticing extremely high loads
>>> (load avg around 300 at times) on our OSSs when in production and serving
>>> IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to
>>> 1.8.4 but the problem remains).  The load increases fairly predictably as
>>> clients generate IO but even 2 clients can produce a load avg above 5.00.
>> Does this impact performance or does it only show up as an unexpectedly high
>> number on the OSSes?
> We have gotten reports of scaling issues that we had not experienced prior to 
> this issue cropping up.  Throughput is certainly less predictable than before 
> but we are able to hit the same peaks.
>
>> /Peter
>>
>>> An identical file system of ours does not exhibit this behavior (sticks
>>> below load avg 1.00 under even the heaviest IO load).  I've looked around
>>> bugzilla and haven't found anything.  We've disabled heartbeat on the
>>> off-chance that was generating the load (it's not), we've attempted using a
>>> different client transport (o2ib->tcp), this did not solve the issue.
>>> There doesn't appear to be any specific non-kernel thread causing the
>>> high-load.  The only info in dmesg/syslog pertains to sporadic client
>>> evictions or sporadic slow setattr due to heavy IO load (we've since tuned
>>> the number of OST threads).  We're basically out of ideas to try.
>>>
>>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet
>>> (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for
>>> 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this
>>> a common problem or should a bug be filed?  Any info available upon
>>> request.  Thanks for your time. ----------------
>>> John White
>>> High Performance Computing Services (HPCS)
>>> (510) 486-7307
>>> One Cyclotron Rd, MS: 50B-3209C
>>> Lawrence Berkeley National Lab
>>> Berkeley, CA 94720
>> _______________________________________________
>> Lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> ----------------
> John White
> High Performance Computing Services (HPCS)
> (510) 486-7307
> One Cyclotron Rd, MS: 50B-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
You should examine you kernel I/O scheduler. The deadline scheduler 
sometimes help in these kinds of circumstances.

~Lawrence

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to