This command is just using ssh to all the nodes and dumping the waiter 
information and collecting it. That means if the node is down, slow to respond, 
or there are a large number of nodes, it could take a while to return.  In my 
400-500 node clusters this command usually take less than 10 seconds. I do 
prefix the command with a timeout value in case a node is hung up and ssh never 
returns (which it sometimes does, and that’s not the fault of GPFS) Something 
like this:

timeout 45s /usr/lpp/mmfs/bin/mmlsnode -N waiters –L

This means I get incomplete information, but if you don’t you end up piling up 
a lot of hung up commands. I would check over your cluster carefully to see if 
there are other issues that might cause ssh to hang up – which could impact 
other GPFS commands that distribute via ssh.

Another approach would be to dump the waiters locally on each node, send node 
specific information to the database, and then sum it up using the graphing 
software.

Bob Oesterlin
Sr Storage Engineer, Nuance HPC Grid

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Roland Pabel 
<[email protected]<mailto:[email protected]>>
Organization: RRZK Uni Köln
Reply-To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Date: Friday, April 15, 2016 at 10:50 AM
To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Subject: Re: [gpfsug-discuss] Executing Callbacks on other Nodes

Hi,

In our cluster, mmlsnode –N waiters –L takes about 25 seconds to run. So
running it every 30 seconds is a bit close. I'll try running it once a minute
and then incorporating this into our graphing.

Maybe the command is so slow for me because a few nodes are down?
Is there a parameter to mmlsnode to configure the timeout?


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to