My basic philosophy, and I suspect there are those who might disagree,
is that you can't use the network to monitor the network, at least not
in times of trouble. That's why I insist on having to query the HCAs
directly since I can't always be sure the network is there and/or
reliable. If you are willing to concede that this can indeed happen
than the question becomes one of how do you reliably get data from an
HCA and that's the basis for my (re)starting this discussion.
As for querying the switch for counters, what do you do on a very large
network, say 10s of thousands of nodes if you want to get performance
data every second? I also realize this is an extreme situation today
(the node count not the frequency of monitoring) but I'm sure everyone
would agree systems of these sizes are not that far off.
-mark
Hal Rosenstock wrote:
Hi Eitan,
On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote:
Hi Ira,
Second, I have run some tests querying the fabric of our
large clusters here (~500 nodes) and the results were
promising for a single node implementation.
I don't recall the numbers as this was a while ago but it was
on the order of
<2 sec and I think <1 but I don't want to be misquoted.
Does PerfMgr query switch ports ?
Yes (of course it does).
If it does I am surprised by the short sweep time you got.
Does it have >1 query on the wire at a given time?
Yes, Default appears to be 500 currently (maybe that needs dialing back
a bit) but is settable via perfmgr_max_outstanding_queries in options
file.
If not then I am even more surprised.
Was the cluster running a job at the time of the query ?
Is this question related to VL0 contention ?
-- Hal
Thanks
Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL
-----Original Message-----
From: Ira Weiny [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 10, 2007 7:47 PM
To: Eitan Zahavi
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED];
[email protected]; [EMAIL PROTECTED]
Subject: Re: [ofa-general] IB performance stats (revisited)
On Thu, 28 Jun 2007 10:24:59 +0300
"Eitan Zahavi" <[EMAIL PROTECTED]> wrote:
On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
In the last months it is the second time I hear people
complaining the
current monitoring solution in OFA is integrated with OpenSM.
I must have missed this both times (didn't see this in Mark's
post) and the statement itself is somewhat inaccurate as well.
Private talks - I hope they will speak up for themselves now...
These people do not use OpenSM but do use OFED.
I'm not sure I'm following what you mean here.
If you mean that some people want to run PerfMgr without
the SM/SA
aspects (so that they can run a vendor based SM), that is
the next
thing we are adding to the implementation.
Exactly. OK when is that coming?
There is very little which ties the current PerfMgr to
OpenSM. Basically it just gets the current fabric topology.
As Hal has said changes are coming.
Another drawback if that
no naming is provided and the reporting uses GUIDs.
Naming is provided via NodeDescription.
This might be good for hosts but is not covering switches ...
It does include switches. However, since most systems have
the same name for multiple switches this becomes ineffective.
I have queried Voltaire for a way to change the
NodeDescription for switches, but at the time I asked, there
was no way to do it. Perhaps there is now? What about other
vendors? This is why ibnetdiscover and other diags have
"switch map" support. (A GUID->name mapping to override the
default NodeDescription.) Nothing would please me more than
to be able to remove that for a more "automatic" solution.
I also can't hold myself from saying again I think you
are going
to hit the wall with the concept of doing the PMA from
a single node.
If you are referring to the fact the PerMgr is currently not
distributed, that will be done as has been stated before.
Good. When is it expected? Will it be OFED 1.3?
When Hal first sent out the PerfMgr design I thought we
should jump right to the distributed model as well. But now
I am glad we have gone the way we did.
First off, we have something which "works" and from which we
can expand.
Second, I have run some tests querying the fabric of our
large clusters here (~500 nodes) and the results were
promising for a single node implementation.
I don't recall the numbers as this was a while ago but it was
on the order of
<2 sec and I think <1 but I don't want to be misquoted.
For sure, a distributed model offers many advantages and we
will get there. But for many the current single node
approach should work just fine.
Thanks,
Ira
Thanks
-- Hal
Eitan Zahavi
Senior Engineering Director, Software Architect Mellanox
Technologies
LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On
Behalf Of Hal
Rosenstock
Sent: Wednesday, June 27, 2007 8:12 PM
To: Mark Seger
Cc: Finn, Ed; [email protected]
Subject: Re: [ofa-general] IB performance stats (revisited)
On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
The performance managers deal with the counter
stickiness (by
resetting them when they think they need to). They
typically export
their data although this is not specified by IBA so it is
in a vendor
proprietary manner.
so I guess these guys are poor citizens as well...
Not sure what you mean.
the real issue as I see it then means nobody can trust
the data if
randon tools randomly reset the counters. a real shame...
I consider this to be a real rather than random app for this.
Guess it depends on what one considers random.
-- Hal
-mark
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general