On Wed, 2007-07-11 at 11:00, Mark Seger wrote: > Hal Rosenstock wrote: > > >On Wed, 2007-07-11 at 10:15, Mark Seger wrote: > > > > > >>My basic philosophy, and I suspect there are those who might disagree, > >>is that you can't use the network to monitor the network, at least not > >>in times of trouble. > >> > >> > > > >Right, in times of certain troubles. > > > > > and that is the key. since you can't know apriori when you're about to > have troubles, you need to be collecting the data locally before they occur. > > >>That's why I insist on having to query the HCAs > >>directly since I can't always be sure the network is there and/or > >>reliable. If you are willing to concede that this can indeed happen > >>than the question becomes one of how do you reliably get data from an > >>HCA and that's the basis for my (re)starting this discussion. > >> > >> > > > >The reliability comes from timeout/retry mechanisms. If performance data > >cannot be obtained on an IB network, it needs to be trouble shooted at a > >lower level (by SMPs). > > > >In any case, a rearchitecture of the PMA was proposed and seems > >reasonable to me in that it can accomodate either approach. All that is > >needed now is for someone to step up and champion an implementation of > >this. Unfortunately, I do not have time to do so. > > > > > I don't know if what I've been proposing requires any rearchitecting as > I see is as something local to each node. Specificially, and there is > already an implementation of this in an earlier voltaire stack, is to > export wrapping HCA counters to /proc. The module that does this > read/clears the counters on every access but since no local applications > are accessing the counters directly, clearing them doesn't hurt anyone. > Alas, anyone else who wants to query the counters will find them reset.
No local application but perhaps a remote one. This is the reason for the proposed rearchitecture (along with synthesizing the wider counters). -- Hal > The other side benefit of exporting these counters is such a way is now > lots of others can collect/report this info. In other words is someone > chose to add IB stats to sar, it would become very easy to do! > > If this is the type of thing people are interested in, I might be able > to supply some code to do it. > > >>As for querying the switch for counters, what do you do on a very large > >>network, say 10s of thousands of nodes if you want to get performance > >>data every second? I also realize this is an extreme situation today > >>(the node count not the frequency of monitoring) but I'm sure everyone > >>would agree systems of these sizes are not that far off. > >> > >> > > > >You have a distributed performance manager to handle this. A hierarchy > >of performance managers has been discussed on the list before. > > > > > ahh, I see. > -mark > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
