Hi everyone,
We recently saw some extremely high stat loads on our lustre FS. Output from
“llstat -i 1 mdt” looked like:
[root@hpfs-fsl-mds0 lustre]#
/proc/fs/lustre/mds/MDS/mdt/stats @ 1530642446.366015124
Name Cur.Count Cur.Rate #Events Unit last
min avg max stddev
req_waittime 182102 182102 22343734858[usec] 3951261
2 38.94 3027235 897.47
req_qdepth 182103 182103 22343734859[reqs] 12528
0 0.08 571 0.29
req_active 182103 182103 22343734859[reqs] 484211
1 2.88 99 1.50
req_timeout 182104 182104 22343734860[sec] 182104
1 9.32 36 13.44
reqbuf_avail 437980 437980 55509906321[bufs] 27996863
32 63.89 65 0.47
This was driving the load on our MDS up into the 100 to 200 range.
Surprisingly, the MDS and the LFS from a client were still generally
responsive. The numbers in the “Cur.Count” column are normally in the 100’s to
1000’s for our file system (we have ~600 lustre clients). The kiblnd_sd_* and
ldlm_* processes were driving up the load.
We’ve tracked down the users causing this. There were two different workloads
that we identified there were causing the problems. One of them was fairly
common, the other is fairly infrequent. There are a couple of things I wanted
input on from the wider community.
First, since one of the workloads is common for our lab and we haven’t seen
this issue before (at least not to this extent), we think this might be related
specifically to 2.10.4, which recently updated to. We didn’t see anything in
the changelog that was obviously related but if there are any other known
issues or groups seeing this, that would be good to know. We are using ZFS on
both the MDT and OST’s.
Also, the ldlm processes lead us to looking at flock vs localflock. On
previous generations of our LFS’s, we used localflock. But on the current LFS,
we decided to try flock instead. This LFS has been in production for a couple
years with no obvious problems due to flock but we decided to drop back to
localflock as a precaution for now. We need to do a more controlled test but
this does seem to help. What are other sites using for locking parameters?
Thanks,
Darby
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org