Kip Murray and Brian Schott have already commented usefully. Here are some further comments which might further help or confuse!
You yourself, Roger, have often pointed out to those of us who have been agonising over fine-tuning some verb or set of verbs that there's little point in fussing over small improvements in performance. Years ago, my assistant was very chuffed with himself for gently pointing out to some food scientists that there was little use in showing with a very high level of statistical significance that one process for baking bread yielded a minute extra level of some nutrient than an alternative process. You presumably will be interested in observing/achieving some useful level of improvement or difference and I suspect that that level of improvement is more important to you than an objective and fairly academic and formal judgment of whether a statistically significant difference has been observed. If you know, a priori, that you wish to demonstrate a certain measure of difference, (for example, that the percentage linear difference in run-time should be greater than 50%), with a certain degree of confidence (p < 0.05, say), you might look into discussions of "Statistical Power". FWIW, Medical and Nutrition Journals expect results to be presented with confidence intervals of given probability, typically 95% or 99%. Something else worth bearing in mind is that it can be useful to examine the actual distributions of what you're measuring. If the empirical probability curve is markedly non-normal, then both the Student's t and the "large-sample" z-statistic are somewhat compromised - still useful, but you're less well assured that the tabled probabilities are appropriate. I'd have thought that the applied side of Computer Science would have well-developed approaches to performance measurement comparison by now, or are they kept quiet for commercial reasons? Mike On 20/10/2011 9:04 PM, Roger Hui wrote: > This message is addressed to Forum members who are knowledgeable in > statistics. > > The objective is to test whether the same expression is faster, > slower, or takes the same amount of time, on the two different > versions of the interpreter. We know that due to vagaries of the > operating system, the way interpreters are built (in particular the > memory usage), the phase of the moon, ... the same expression will run > in different times. Are the times "the same"? > > > From stat courses taken long ago and from consulting ancient stats > texts, I get the idea that the following may be applicable: > > a. "Large-Sample Test" on the mean running time, with Z=(theta - > theta0)%s_theta0 as the normally distributed statistic. > > b. "Small-Sample Test for Comparing Two Population Means", with T=(Y0 > - Y1) % S * %: (%n0)+(%n1) as the t-distributed statistic. > > I believe what I want is a "Large-Sample Test for Comparing Two > Population Means". (Large-Sample because I can run as many benchmarks > as I like.) > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
