Hi everyone,

We recently saw some extremely high stat loads on our lustre FS.  Output from 
“llstat -i 1 mdt” looked like:

[root@hpfs-fsl-mds0 lustre]#
/proc/fs/lustre/mds/MDS/mdt/stats @ 1530642446.366015124
Name                      Cur.Count  Cur.Rate   #Events   Unit           last   
     min          avg        max    stddev
req_waittime              182102     182102     22343734858[usec]      3951261  
        2        38.94    3027235    897.47
req_qdepth                182103     182103     22343734859[reqs]        12528  
        0         0.08        571      0.29
req_active                182103     182103     22343734859[reqs]       484211  
        1         2.88         99      1.50
req_timeout               182104     182104     22343734860[sec]        182104  
        1         9.32         36     13.44
reqbuf_avail              437980     437980     55509906321[bufs]     27996863  
       32        63.89         65      0.47

This was driving the load on our MDS up into the 100 to 200 range.  
Surprisingly, the MDS and the LFS from a client were still generally 
responsive.  The numbers in the “Cur.Count” column are normally in the 100’s to 
1000’s for our file system (we have ~600 lustre clients). The kiblnd_sd_* and 
ldlm_* processes were driving up the load.

We’ve tracked down the users causing this.  There were two different workloads 
that we identified there were causing the problems.  One of them was fairly 
common, the other is fairly infrequent.  There are a couple of things I wanted 
input on from the wider community.

First, since one of the workloads is common for our lab and we haven’t seen 
this issue before (at least not to this extent), we think this might be related 
specifically to 2.10.4, which recently updated to.  We didn’t see anything in 
the changelog that was obviously related but if there are any other known 
issues or groups seeing this, that would be good to know.  We are using ZFS on 
both the MDT and OST’s.

Also, the ldlm processes lead us to looking at flock vs localflock.  On 
previous generations of our LFS’s, we used localflock.  But on the current LFS, 
we decided to try flock instead.  This LFS has been in production for a couple 
years with no obvious problems due to flock but we decided to drop back to 
localflock as a precaution for now.  We need to do a more controlled test but 
this does seem to help.  What are other sites using for locking parameters?

Thanks,
Darby
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to