Re: [ofa-general] IB performance stats (revisited)

Mark Seger Wed, 11 Jul 2007 07:54:46 -0700


Eitan Zahavi wrote:

Hi Marc,

I published an RFC and later had discussions regarding the distribution
of query ownership of switch counters.
Making this ownership purely dynamic, semi-dynamic or even static is an
implementation tradeoff.
However, it can be shown that the maximal number of switches a single
compute node would be responsible for is <= number of switch levels. So
no problem to get counters every second...

The issue is: what do you do with the size of data collected?
This is only relevant if monitoring is run in "profiling mode" otherwise
only link health errors should be reported.

I use IB data for performance data typically for system/applicationdiagnostics. I run a tool I wrote (seehttp://sourceforge.net/projects/collectl/) as a service on most systemsand it gathers well over hundreds of performance metrics/counters oneverything from cpu load, memory, network, infiniband, disk, etc. Thephilosophy here is that if something goes wrong, it may be too late tothen run some diagnostic. Rather you need to have already collected thedata, especially if this is an intemittent problem. When there is noneed to look at the data, it just gets purged away after a week.

There have been situation where someone reports a batch program they ranthe other day was really slow and they didn't change anything. By beingable to pull up a monitoring log and seeing what the system was doing atthe time of the run might reveal their network was saturated andtherefore their MPI job was impacted. You can't very well turn ondiagnostics and rerun the application because system conditions haveprobably changed.

Does that help? Why don't you try installing collectl and see what itdoes...


-mark


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] IB performance stats (revisited)

Reply via email to