Hi Marc, I wish I had a large enough fabric worth testing collectl on...
I did the math for how much data would be collected for 10Knodes cluster. It is ~7MB for each iteration: 10K ports * 6 (3 level fabric * 2 ports on each link) * 32 byte (data/pkts tx/rx) + 22byte (err counters) + 64byte (cong counters) = 116bytes Seems reasonable - but adds up to large amount of data over a day period assuming a collect every second: 24*60*60 *116*10000*6 = 6.01344e+11 Bytes of storage Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Mark Seger [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 11, 2007 5:51 PM > To: Eitan Zahavi > Cc: Hal Rosenstock; Ira Weiny; [email protected]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] IB performance stats (revisited) > > > > Eitan Zahavi wrote: > > >Hi Marc, > > > >I published an RFC and later had discussions regarding the > distribution > >of query ownership of switch counters. > >Making this ownership purely dynamic, semi-dynamic or even > static is an > >implementation tradeoff. > >However, it can be shown that the maximal number of switches > a single > >compute node would be responsible for is <= number of switch > levels. So > >no problem to get counters every second... > > > >The issue is: what do you do with the size of data collected? > >This is only relevant if monitoring is run in "profiling mode" > >otherwise only link health errors should be reported. > > > > > I use IB data for performance data typically for > system/application diagnostics. I run a tool I wrote (see > http://sourceforge.net/projects/collectl/) as a service on > most systems and it gathers well over hundreds of performance > metrics/counters on everything from cpu load, memory, > network, infiniband, disk, etc. The philosophy here is that > if something goes wrong, it may be too late to then run some > diagnostic. Rather you need to have already collected the > data, especially if this is an intemittent problem. When > there is no need to look at the data, it just gets purged > away after a week. > > There have been situation where someone reports a batch > program they ran the other day was really slow and they > didn't change anything. By being able to pull up a > monitoring log and seeing what the system was doing at the > time of the run might reveal their network was saturated and > therefore their MPI job was impacted. You can't very well > turn on diagnostics and rerun the application because system > conditions have probably changed. > > Does that help? Why don't you try installing collectl and > see what it does... > > -mark > > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
