On Thu, Aug 20, 2009 at 1:21 PM, Michael G Schwern <schw...@pobox.com>wrote:

> Jim Cromie wrote:
> > What's notable in its absence is any *real* use of perl-dist's tests.
> > I dug into the code, and found that this works.
> >
> > $> HARNESS_TIMER=1 make test
> > ext/threads/t/end.............................................ok       60
> ms
> > ext/threads/t/err.............................................ok     1090
> ms
> > ext/threads/t/exit............................................ok      365
> ms
> > ext/threads/t/free2...........................................ok     2367
> ms
> > ext/threads/t/free............................................ok    12383
> ms
> >
> > so clearly:
> > - somebody thought of this already ;-)
> > - does Test::* support this feature ?
> > - t/harness doesnt, but HARNESS_OPTIONS could allow 't'
> > - it wouldnt be too hard to collect this data.
> > - could by default for -Dusedevel
>
> Here's the problems with this approach:
>
> A) Unless I'm mistaken, its doing wallclock measurement.
> B) Its including time for the harness' own parsing.
> C) Its including time for Test::More / test.pl.
> D) The tests change.
>
> A is a known problem and we know how to fix it, record CPU and SYS time.
>
> B and C mean a change to Test::Harness or Test::More could throw the whole
> thing off.  Since they change all the time you can't compare the results
> from
> Perl version A with Perl version B which are any significant distance
> apart.
> Maybe Perl's performance changed, maybe Test::More's did.  There will be
> too
> many false positives to get useful results.
>
> B can probably be solved by just measuring the CPU and system time of the
> child process.
>
> C is tricker and I have no good suggestions.
>
> D is not really possible to solve in the general case.  For any given
> version
> of Perl you've got different test code.  Before a benchmark could be
> compared
> you'd have to see if the test file changed between versions.  They will.
>  You
> could try freezing the test suite and using that as a benchmark suite but
> eventually stuff will break.
>
> There is information to be gotten out of just saving test performance
> snapshots, but it is not trivial to interpret.  Interpreted wrong it would
> lead to bad decisions.
>


I agree with all of that, more or less.
Low quality data is easy to get, interpretation is hard.

but data + statistical methods --> good info. at least plausibly.

Id agree that "collect the data, and someone will analyse it"
isnt a particularly sophisticated strategy, but its not just
a coin-toss either.

Were a collection scheme to be considered, it should really be able to
collect
more details about each run, including info from separate tools,
but simple statistics on lots of data can probably control for ABCD.

Id also amplify the point about copious data by observing that
smokers will regularly run 16 - 32 different configurations through the
test-suite,
even cursory wallclock data has some value when taken on a machine that
isn't used for anything else, and has a lower system-workload-noise.
I now find myself wondering whether theres a notable difference in load
variance
between a standard laptop and a smoking-only box.

Reply via email to