I think the raw values don't matter so much because there is some randomization involved? And the same random seed is used...
Your DefaultSimilarityRuns look pretty stable. its between 0.0% and 1.5% variation which is about as good as it gets for HighTerm.... LowTerm i am guessing is always noisy because they are so fast. a few of these measures at least are, i know particularly IntNRQ :) On Fri, Aug 16, 2013 at 6:20 PM, Tom Burton-West <[email protected]> wrote: > Hello, > > I'm trying to benchmark a change to BM25Similarity (LUCENE-5175 )using > luceneutil > > I'm running this on a lightly loaded machine with a load average (top) of > about 0.01 when the benchmark is not running. > > I made the following changes: > 1) localrun.py changed Competition(debug=True) to Competition(debug=False) > 2) made the following changes to localconstants.py per Robert Muir's > suggestion: > JAVA_COMMAND = 'java -server -Xms4g -Xmx4g' > SEARCH_NUM_THREADS = 1 > 3) for the BM25 tests set SIMILARITY_DEFAULT='BM25Similarity' > 4) for the BM25 tests uncommened the following line from searchBench.py > #verifyScores = False > > Attached is output from iter 19 of several runs > > The first 4 runs show consistently that the modified version is somewhere > between 6% and 8% slower on the tasks with the highest difference between > trunk and patch. > However if you look at the baseline TaskQPS, for HighTerm, for example, run > 3 is about 55 and run 1 is about 88. So the difference for this task > between different runs of the bench program is very much higher than the > differences between trunk and modified/patch within a run. > > Is this to be expected? Is there a reason I should believe the > differences shown within a run reflect the true differences? > > Seeing this variability, I then switched DEFAULT_SIMILARITY back to > "DefaultSimilarity". In this case trunk and my_modified, should be > exercising exactly the same code, since the only changes in the patch are > the addition of a test case for BM25Similarity and a change to > BM25Similarity. > > In this case the "modified" version varies from -6.2% difference from the > base to +4.4% difference from the base for LowTerm. > Comparing QPS for the base case for HighTerm between different runs we can > see it varies from about 21 for run 1 to 76 for run 3. > > Is this kind of variation between runs of the benchmark to be expected? > > Any suggestions about where to look to reduce the variations between runs? > > Tom > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
