Re: [Lustre-discuss] help tracking down extremely high loads on OSSs

Jason Hill Tue, 19 Oct 2010 08:22:22 -0700

Also something to look at if you aren't having any luck with other avenues
would be the debug log with RPC trace enabled. We do something like:


echo +rpctrace > /proc/sys/lnet/debug; 
lctl dk > /dev/null; sleep 60; 
lctl dk > /tmp/rpctrace; echo -rpctrace > /proc/sys/lnet/debug

You'll need to know what all the opcodes are (that's available in the code I
beleive), but that will give you a definate breakdown of every action thats'
happening. 

You may also want to look at +neterror, etc. More info available from the
manual or lustre.org I'm sure. 

-Jason

On Tue, Oct 19, 2010 at 11:01:59AM -0400, Lawrence Sorrillo wrote:
>   On 10/18/2010 2:58 PM, John White wrote:
> > On Oct 18, 2010, at 10:49 AM, Peter Kjellstrom wrote:
> >
> >> On Monday 18 October 2010, John White wrote:
> >>> Hello Folks,
> >>>   A while back (say 3 weeks ago) we started noticing extremely high loads
> >>> (load avg around 300 at times) on our OSSs when in production and serving
> >>> IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to
> >>> 1.8.4 but the problem remains).  The load increases fairly predictably as
> >>> clients generate IO but even 2 clients can produce a load avg above 5.00.
> >> Does this impact performance or does it only show up as an unexpectedly 
> >> high
> >> number on the OSSes?
> > We have gotten reports of scaling issues that we had not experienced prior 
> > to this issue cropping up.  Throughput is certainly less predictable than 
> > before but we are able to hit the same peaks.
> >
> >> /Peter
> >>
> >>> An identical file system of ours does not exhibit this behavior (sticks
> >>> below load avg 1.00 under even the heaviest IO load).  I've looked around
> >>> bugzilla and haven't found anything.  We've disabled heartbeat on the
> >>> off-chance that was generating the load (it's not), we've attempted using 
> >>> a
> >>> different client transport (o2ib->tcp), this did not solve the issue.
> >>> There doesn't appear to be any specific non-kernel thread causing the
> >>> high-load.  The only info in dmesg/syslog pertains to sporadic client
> >>> evictions or sporadic slow setattr due to heavy IO load (we've since tuned
> >>> the number of OST threads).  We're basically out of ideas to try.
> >>>
> >>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet
> >>> (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for
> >>> 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this
> >>> a common problem or should a bug be filed?  Any info available upon
> >>> request.  Thanks for your time. ----------------
> >>> John White
> >>> High Performance Computing Services (HPCS)
> >>> (510) 486-7307
> >>> One Cyclotron Rd, MS: 50B-3209C
> >>> Lawrence Berkeley National Lab
> >>> Berkeley, CA 94720
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> [email protected]
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > ----------------
> > John White
> > High Performance Computing Services (HPCS)
> > (510) 486-7307
> > One Cyclotron Rd, MS: 50B-3209C
> > Lawrence Berkeley National Lab
> > Berkeley, CA 94720
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > [email protected]
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> You should examine you kernel I/O scheduler. The deadline scheduler 
> sometimes help in these kinds of circumstances.
> 
> ~Lawrence
> 
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
-Jason
-------------------------------------------------
//  Jason J. Hill                              //
//  HPC Systems Administrator                  //
//  National Center for Computational Sciences //
//  Oak Ridge National Laboratory              // 
//  e-mail: [email protected]                    //
//  Phone: (865) 576-5867                      //
-------------------------------------------------
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] help tracking down extremely high loads on OSSs

Reply via email to