Roland

  Here's a tool written by NCAR that provides waiter information on a per node 
bases using a light weight daemon on the monitored node.   I have been using it 
for a while and it has helped me find and figure out long waiter nodes.

  It might do what you are looking for.

  https://sourceforge.net/projects/gpfsmonitorsuite/

jeff

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Roland Pabel
Sent: Monday, April 18, 2016 9:10 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] Executing Callbacks on other Nodes

Hi Bob,

I'll try the second approach, i.e, collecting "mmfsadm dump waiters" locally 
and then summing the values up, since it can be done without the overhead of 
ssh.

You mentioned mmlsnode starts all these ssh commands and that made me look into 
the file itself. I then noticed most of the mm commands are actually scripts. 
This helps a lot with regards to my original question. mmdsh seems to do what I 
need.

Thanks,

Roland


> This command is just using ssh to all the nodes and dumping the waiter 
> information and collecting it. That means if the node is down, slow to 
> respond, or there are a large number of nodes, it could take a while 
> to return.  In my 400-500 node clusters this command usually take less 
> than 10 seconds. I do prefix the command with a timeout value in case 
> a node is hung up and ssh never returns (which it sometimes does, and 
> that’s not the fault of GPFS) Something like this:
 
> timeout 45s /usr/lpp/mmfs/bin/mmlsnode -N waiters –L
> 
> This means I get incomplete information, but if you don’t you end up 
> piling up a lot of hung up commands. I would check over your cluster 
> carefully to see if there are other issues that might cause ssh to 
> hang up – which could impact other GPFS commands that distribute via ssh.
 
> Another approach would be to dump the waiters locally on each node, 
> send node specific information to the database, and then sum it up 
> using the graphing software.
 
> Bob Oesterlin
> Sr Storage Engineer, Nuance HPC Grid
> 
> From:
> <[email protected]<mailto:gpfsug-discuss-bounce
> s@spe ctrumscale.org>> on behalf of Roland Pabel 
> <[email protected]<mailto:[email protected]>>
> Organization: RRZK Uni Köln
> Reply-To: gpfsug main discussion list
> <[email protected]<mailto:gpfsug-discuss@spectrumscale.
> org>>
> 
 Date: Friday, April 15, 2016 at 10:50 AM
> To: gpfsug main discussion list
> <[email protected]<mailto:gpfsug-discuss@spectrumscale.
> org>>
> 
 Subject: Re: [gpfsug-discuss] Executing Callbacks on other Nodes
> 
> Hi,
> 
> In our cluster, mmlsnode –N waiters –L takes about 25 seconds to run. 
> So running it every 30 seconds is a bit close. I'll try running it 
> once a minute
 and then incorporating this into our graphing.
> 
> Maybe the command is so slow for me because a few nodes are down?
> Is there a parameter to mmlsnode to configure the timeout?
> 
> 

--
Dr. Roland Pabel
Regionales Rechenzentrum der Universität zu Köln (RRZK) Weyertal 121, Raum 3.07
D-50931 Köln

Tel.: +49 (221) 470-89589
E-Mail: [email protected]
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to