This command is just using ssh to all the nodes and dumping the waiter information and collecting it. That means if the node is down, slow to respond, or there are a large number of nodes, it could take a while to return. In my 400-500 node clusters this command usually take less than 10 seconds. I do prefix the command with a timeout value in case a node is hung up and ssh never returns (which it sometimes does, and that’s not the fault of GPFS) Something like this:
timeout 45s /usr/lpp/mmfs/bin/mmlsnode -N waiters –L This means I get incomplete information, but if you don’t you end up piling up a lot of hung up commands. I would check over your cluster carefully to see if there are other issues that might cause ssh to hang up – which could impact other GPFS commands that distribute via ssh. Another approach would be to dump the waiters locally on each node, send node specific information to the database, and then sum it up using the graphing software. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: <[email protected]<mailto:[email protected]>> on behalf of Roland Pabel <[email protected]<mailto:[email protected]>> Organization: RRZK Uni Köln Reply-To: gpfsug main discussion list <[email protected]<mailto:[email protected]>> Date: Friday, April 15, 2016 at 10:50 AM To: gpfsug main discussion list <[email protected]<mailto:[email protected]>> Subject: Re: [gpfsug-discuss] Executing Callbacks on other Nodes Hi, In our cluster, mmlsnode –N waiters –L takes about 25 seconds to run. So running it every 30 seconds is a bit close. I'll try running it once a minute and then incorporating this into our graphing. Maybe the command is so slow for me because a few nodes are down? Is there a parameter to mmlsnode to configure the timeout?
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
