Bill,

One option I have used in the past is to look at the rpc request history.  For 
example, on an oss server, you can run:

lctl get_param ost.OSS.ost_io.req_history

and then extract the client nid for each request.   Based on that, you can 
calculate the number of requests coming into the server and look for any 
clients that are significantly higher than the others.  Maybe something like:

lctl get_param ost.OSS.ost_io.req_history | cut -d: -f3 | sort | uniq -c | sort 
-n

I have used that approach in the past to identify misbehaving clients (the 
number of requests from such clients was usually one or two orders of magnitude 
higher than the others).  If multiple clients are unusually high, you may be 
able to correlate the nodes with currently running jobs to identify a 
particular job (assuming you don't already have lustre job stats enabled).

-Rick


On 5/4/21, 2:41 PM, "lustre-discuss on behalf of Bill Anderson via 
lustre-discuss" <[email protected] on behalf of 
[email protected]> wrote:


       Hi All,

       Can you recommend good ways to identify Lustre client hosts that might 
be causing stability or performance problems for the entire filesystem?

       For example, if a user is inadvertently doing something that's creating 
an RPC storm, what are good ways to identify the client host that has triggered 
the storm?

       Thank you!

       Bill

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to