Luceneutil high variability between runs

Tom Burton-West Fri, 16 Aug 2013 15:21:35 -0700

Hello,

I'm trying to benchmark a change to BM25Similarity (LUCENE-5175 )using
luceneutil


I'm running this on a lightly loaded machine with a load average (top) of
about 0.01 when the benchmark is not running.

I made the following changes:
1) localrun.py changed Competition(debug=True) to Competition(debug=False)
2) made the following changes to localconstants.py per Robert Muir's
suggestion:
JAVA_COMMAND = 'java -server -Xms4g -Xmx4g'
SEARCH_NUM_THREADS = 1
3) for the BM25 tests set SIMILARITY_DEFAULT='BM25Similarity'
4) for the BM25 tests uncommened   the following line from searchBench.py
#verifyScores = False

Attached is output from iter 19 of several runs

The first 4 runs show consistently that the modified version is somewhere
between 6% and 8% slower on the tasks with the highest difference between
trunk and patch.
However if you look at the baseline TaskQPS, for HighTerm, for example,
 run 3 is about 55 and run 1 is about 88.  So the difference for this task
 between different runs of the bench program is very much higher than the
differences between trunk and modified/patch within a run.

Is this to be expected?   Is there a reason I should believe  the
differences shown within a run reflect the true differences?

Seeing this variability, I then switched DEFAULT_SIMILARITY back to
"DefaultSimilarity".  In this case trunk and my_modified, should be
exercising exactly the same code, since the only changes in the patch are
the addition of a test case for BM25Similarity and a change to
BM25Similarity.

In this case the "modified" version varies from -6.2% difference from the
base to +4.4% difference from the base for LowTerm.
Comparing  QPS for the base case for HighTerm between different runs we can
see it varies from about 21 for run 1 to 76 for run 3.

Is this kind of  variation between runs of the benchmark to be expected?

Any suggestions about where to look to reduce the variations between runs?

Tom

BM25Similarity runs where "my_modified_version" is LUCENE-


 tail -33 BM25SimRun1 |head -5
Report after iter 19:
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff
                HighTerm       87.91     (13.2%)       81.02      (8.5%)   
-7.8% ( -26% -   16%)
                 MedTerm      111.81     (13.2%)      103.11      (8.4%)   
-7.8% ( -25% -   15%)
                 LowTerm      411.44     (17.7%)      382.47     (14.5%)   
-7.0% ( -33% -   30%)
[tburtonw@alamo runs]$ tail -33 BM25SimRun2 |head -5
Report after iter 19:
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff
                HighTerm       62.15      (6.4%)       58.10      (7.1%)   
-6.5% ( -18% -    7%)
                 MedTerm      139.11      (4.5%)      130.22      (7.5%)   
-6.4% ( -17% -    5%)
                 LowTerm      391.93     (10.5%)      373.71     (13.1%)   
-4.6% ( -25% -   21%)
[tburtonw@alamo runs]$ tail -33 BM25SimRun3 |head -5
Report after iter 19:
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff
                HighTerm       54.85      (6.5%)       50.18      (1.6%)   
-8.5% ( -15% -    0%)
                 MedTerm      146.04      (8.6%)      137.31      (4.7%)   
-6.0% ( -17% -    8%)
            OrNotHighLow       45.85     (11.1%)       43.37     (10.6%)   
-5.4% ( -24% -   18%)
[tburtonw@alamo runs]$ tail -33 BM25SimRun4 |head -5
Report after iter 19:
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff
            OrNotHighMed       49.40      (8.7%)       45.37      (8.8%)   
-8.2% ( -23% -   10%)
            OrNotHighLow       65.48      (8.7%)       60.19      (9.0%)   
-8.1% ( -23% -   10%)
           OrNotHighHigh       37.06      (8.2%)       34.18      (8.2%)   
-7.8% ( -22% -    9%)

==================================================================================================================
Default similarity, which is not modified by the BM25 patch

DefaultSimRun1
                 LowTerm      398.97     (17.9%)      398.94     (18.1%)   
-0.0% ( -30% -   43%)
                HighTerm       21.13     (12.1%)       21.45     (12.2%)    
1.5% ( -20% -   29%)
DefaultSimRun2
                 LowTerm      406.93     (17.1%)      381.51     (15.8%)   
-6.2% ( -33% -   32%)
                HighTerm       59.21      (2.5%)       59.70      (3.5%)    
0.8% (  -5% -    7%)
DefaultSimRun3
                 LowTerm      431.59     (18.5%)      450.55     (16.8%)    
4.4% ( -26% -   48%)
                HighTerm       76.45      (2.0%)       76.45      (1.7%)    
0.0% (  -3% -    3%)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Luceneutil high variability between runs

Reply via email to