We've thoroughly examined the back-end storage and the connections between the 
OSSs and back-end.  There are no faults as of now.  Previously our couplet had 
lost cache sync, but that's since been resolved and the load issue remains.


On Oct 18, 2010, at 10:43 AM, Paul Nowoczynski wrote:

> I wonder if there's some type of fault in the I/O path which is increasing 
> the latency of individual I/Os?  Something like this could affect the load 
> especially when considering the number of kernel threads on the OST.
> paul
> 
> John White wrote:
>> Hello Folks,
>>      A while back (say 3 weeks ago) we started noticing extremely high loads 
>> (load avg around 300 at times) on our OSSs when in production and serving 
>> IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to 
>> 1.8.4 but the problem remains).  The load increases fairly predictably as 
>> clients generate IO but even 2 clients can produce a load avg above 5.00.  
>> An identical file system of ours does not exhibit this behavior (sticks 
>> below load avg 1.00 under even the heaviest IO load).  I've looked around 
>> bugzilla and haven't found anything.  We've disabled heartbeat on the 
>> off-chance that was generating the load (it's not), we've attempted using a 
>> different client transport (o2ib->tcp), this did not solve the issue.  There 
>> doesn't appear to be any specific non-kernel thread causing the high-load.  
>> The only info in dmesg/syslog pertains to sporadic client evictions or 
>> sporadic slow setattr due to heavy IO load (we've since tuned the number of 
>> OST threads).  We're basica
 lly
>>  out of ideas to try.
>> 
>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet (15 
>> tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for 1.8.4.  
>> The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this a common 
>> problem or should a bug be filed?  Any info available upon request.  Thanks 
>> for your time.
>> ----------------
>> John White
>> High Performance Computing Services (HPCS)
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50B-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>  
> 
> 

----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to