On 2010-07-08, at 12:03, Wojciech Turek wrote:
> Our Lustre filesystem (Lustre 1.8.3, RHEL5) got recently very busy and users 
> are noticing the slowness. The Lustre system consists of ~550 clients and 
> currently we have 50 different users running jobs. I can see that OSS servers 
> have load oscillating between 100-300 and collectl shows that there are lots 
> of I/O going on (mainly read). I would like to find a good method of finding 
> out which Lustre clients are generating the I/O so I could pinpoint the high 
> load to a particular jobs. I hope that some Lustre users can share their 
> experience in that matter.

There are a number of ways to do this.  One way is to check the 
"/proc/fs/lustre/obdfilter/*/exports/*/stats" files, which contains per-client 
statistics.  They can be cleared by writing "0" to the file, and then check for 
files with lots of operations.

Another way that I heard some sites were doing this is to use the "rpc 
history".  They may already have a script to do this, but the basics are below:

oss# lctl set_param ost.OSS.ost_io.req_buffer_history=10240
{wait a few seconds to collect some history}
oss# lctl get_param ost.OSS.ost_io.req_history

This will give you a list of the past (up to) 10240 RPCs for the "ost_io" RPC 
service, which is what you are observing the high load on:

3436037:192.168.2...@tcp:12345-192.168.20....@tcp:x1340648957534353:448:Complete:1278612656:0s(-6s)
 opc 3
3436038:192.168.2...@tcp:12345-192.168.20....@tcp:x1340648957536190:448:Complete:1278615489:1s(-41s)
 opc 3
3436039:192.168.2...@tcp:12345-192.168.20....@tcp:x1340648957536193:448:Complete:1278615490:0s(-6s)
 opc 3

This output is in the format:

identifier:target_nid:source_nid:rpc_xid:rpc_size:rpc_status:arrival_time:service_time(deadline)
 opcode

Using some shell scripting, one can find the clients sending the most RPC 
requests:

oss# lctl get_param ost.OSS.ost_io.req_history | tr ":" " " | cut -d" " 
-f3,9,10 | sort | uniq -c | sort -nr | head -20


   3443 12345-192.168.20....@tcp opc 3
   1215 12345-192.168.20....@tcp opc 3
    121 12345-192.168.20....@tcp opc 4

This will give you a sorted list of the top 20 clients that are sending the 
most RPCs to the ost_io service, along with the operation being done (3 = 
OST_READ, 4 = OST_WRITE).

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to