Kip Murray and Brian Schott have already commented usefully.
Here are some further comments which might further help or
confuse!

You yourself,  Roger,  have often pointed out to those of us
who have been agonising over fine-tuning some verb or
set of verbs that there's little point in fussing over small
improvements in performance.

Years ago,  my assistant was very chuffed with himself for gently
pointing out to some food scientists that there was little use in
showing with a very high level of statistical significance that one
process for baking bread yielded a minute extra level of some
nutrient than an alternative process.

You presumably will be interested in observing/achieving some
useful level of improvement or difference and I suspect that that
level of improvement is more important to you than an objective
and fairly academic and formal judgment of whether a statistically
significant difference has been observed.

If you know, a priori,  that you wish to demonstrate a certain
measure of difference,  (for example, that the percentage linear
difference in run-time should be greater than 50%),  with a certain
degree of confidence (p < 0.05, say),  you might look into discussions
of "Statistical Power".

FWIW,  Medical and Nutrition Journals expect results to be presented
with confidence intervals of given probability,  typically 95% or 99%.

Something else worth bearing in mind is that it can be useful to
examine the actual distributions of what you're measuring.  If the
empirical probability curve is markedly non-normal,  then both
the Student's t and the "large-sample" z-statistic are somewhat
compromised - still useful,  but you're less well assured that the
tabled probabilities are appropriate.

I'd have thought that the applied side of Computer Science would
have well-developed approaches to performance measurement
comparison by now,  or are they kept quiet for commercial
reasons?

Mike

On 20/10/2011 9:04 PM, Roger Hui wrote:
> This message is addressed to Forum members who are knowledgeable in 
> statistics.
>
> The objective is to test whether the same expression is faster,
> slower, or takes the same amount of time, on the two different
> versions of the interpreter.  We know that due to vagaries of the
> operating system, the way interpreters are built (in particular the
> memory usage), the phase of the moon, ... the same expression will run
> in different times.  Are the times "the same"?
>
> > From stat courses taken long ago and from consulting ancient stats
> texts, I get the idea that the following may be applicable:
>
> a. "Large-Sample Test" on the mean running time, with Z=(theta -
> theta0)%s_theta0 as the normally distributed statistic.
>
> b. "Small-Sample Test for Comparing Two Population Means", with T=(Y0
> - Y1) % S * %: (%n0)+(%n1) as the t-distributed statistic.
>
> I believe what I want is a "Large-Sample Test for Comparing Two
> Population Means".  (Large-Sample because I can run as many benchmarks
> as I like.)
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to