Re: [ofa-general] IB performance stats (revisited)

Mark Seger Wed, 11 Jul 2007 08:13:13 -0700

Hal Rosenstock wrote:

On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
My basic philosophy, and I suspect there are those who might disagree,is that you can't use the network to monitor the network, at least notin times of trouble.
Right, in times of certain troubles.

and that is the key. since you can't know apriori when you're about tohave troubles, you need to be collecting the data locally before they occur.

That's why I insist on having to query the HCAsdirectly since I can't always be sure the network is there and/orreliable. If you are willing to concede that this can indeed happenthan the question becomes one of how do you reliably get data from anHCA and that's the basis for my (re)starting this discussion.
The reliability comes from timeout/retry mechanisms. If performance data
cannot be obtained on an IB network, it needs to be trouble shooted at a
lower level (by SMPs).

In any case, a rearchitecture of the PMA was proposed and seems
reasonable to me in that it can accomodate either approach. All that is
needed now is for someone to step up and champion an implementation of
this. Unfortunately, I do not have time to do so.

I don't know if what I've been proposing requires any rearchitecting asI see is as something local to each node. Specificially, and there isalready an implementation of this in an earlier voltaire stack, is toexport wrapping HCA counters to /proc. The module that does thisread/clears the counters on every access but since no local applicationsare accessing the counters directly, clearing them doesn't hurt anyone.Alas, anyone else who wants to query the counters will find them reset.

The other side benefit of exporting these counters is such a way is nowlots of others can collect/report this info. In other words is someonechose to add IB stats to sar, it would become very easy to do!

If this is the type of thing people are interested in, I might be ableto supply some code to do it.

As for querying the switch for counters, what do you do on a very largenetwork, say 10s of thousands of nodes if you want to get performancedata every second? I also realize this is an extreme situation today(the node count not the frequency of monitoring) but I'm sure everyonewould agree systems of these sizes are not that far off.
You have a distributed performance manager to handle this. A hierarchy
of performance managers has been discussed on the list before.

ahh, I see.
-mark


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] IB performance stats (revisited)

Reply via email to