Re: Benchmarking results

Marvin Humphrey Thu, 06 Apr 2006 17:26:15 -0700


On Apr 4, 2006, at 10:23 AM, Tatu Saloranta wrote:

So in this case, what would give more comparable results (assuming
you are interested in measuring likely server-side
usage scenario, which is usually what Lucene is used for)

My main interest with these tests is algorithmic performance. Howmuch time it takes to start up or warm up a JVM isn't something Iwant to be measuring. There are startup issues I'm concerned about,but they mostly relate to file format design. The load time forfield norms is a significant concern. So is the IndexInterval, whichis set to 1024 by default instead of 128 as in Lucene. So is thelocality of reference issue for where the term vector data getsstored. All of those things affect the total time it takes for aKinoSearch app to launch, load, search, and return results, whichneeds to be as small as possible so that e.g. website search appsindexing up to [some large number of] documents can be run as simpleCGI scripts. I'm considering further modifications to the fileformat to keep that total time down...

Actually, I think the benchmark results illustrate that everyoneshould be at least mildly concerned about where the Term Vector datagets stored. KinoSearch only writes that data once. Lucene,however, has to read/write that data during each merge, and the morestreams you have, the more complex the merge. It stands to reasonthat storing term vector data with the stored fields data would speedup the merge process.

I brought this issue up a few weeks ago, but in a search-timecontext. The two primary applications for Term Vector data that I amaware of are excerpting/highlighting and "more like this" searches,both of which would benefit from having the term vectors stored withthe documents, because each search would require fewer disk seeks.Term Vectors might also be used to build a pure vector space searchengine, like the one described in this article <http://www.perl.com/pub/a/2003/02/19/engine.html>, but that's impractical for indexeslarger than a handful of documents and of academic interest only.Are there any other significant applications? If not, I submit thatterm vectors belong in the .fdx file.

would be to run all runs within same JVM / execution (for Perl),

Thanks for the critique. I've updated the indexer apps to accept twocommand line arguments. They're now run like so:


    java [ARGS] LuceneIndexer -reps 6 -docs 1000
    perl indexers/kinosearch_indexer.plx --reps=6 --docs=1000

With the new methodology, the numbers are slightly better forLucene. They're actually worse for KinoSearch. I've isolated thecode that's responsible for the slowdown that and I speculate thatit's a memory fragmentation issue, as I can solve it by forcingKinoSearch to consume more memory at that point. However, havingestablished that KinoSearch is in Lucene's league with regards toindexing speed, I'm not worried about absolute numbers, and the newbenchmarker interface is slightly more stable, allowing more accuratecomparative analysis of algorithmic efficiency. The trends are stillapparent: KinoSearch gains ground when there's stored and vectorizedcontent.


Raw data is below.

and either take the fastest runs, or discard the first one and takemedian or
average.

As you'll see in the raw data, the apps now produce two aggregatenumbers: a mean, and a truncated mean <http://en.wikipedia.org/wiki/Truncated_mean>.

ps. Regarding memory usage: it is also quite tricky to measure
 reliably, since Garbage Collection only kicks in when it has to...
 so Java uses as much memory as it can (without expanding heap)...
 plus, JVMs do not necessarily (or even usually) return unused
 chunks later on.

Yes. Still, there is a correlation between maxBufferedDocs and maxmemory consumption by the process. So Java must be reusing something...


    maxBufferedDocs   max memory (1 rep)   truncated mean time (6 reps)
    -------------------------------------------------------------------
        10                69 MB                124.89 secs
       100                91 MB                 88.17 secs
      1000               169 MB                 84.80 secs

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

RAW DATA - JVM warmup / truncated mean experiment
===================================================

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer -reps 6

---------------------------------------------------
1   Secs: 87.02  Docs: 19043
2   Secs: 84.56  Docs: 19043
3   Secs: 85.04  Docs: 19043
4   Secs: 83.83  Docs: 19043
5   Secs: 84.75  Docs: 19043
6   Secs: 84.84  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 85.01 secs
Truncated mean (4 kept, 2 discarded): 84.80 secs
---------------------------------------------------

slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/benchmarks/slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/perl -Mblib indexers/kinosearch_indexer.plx --reps 6

------------------------------------------------------------
1    Secs: 75.51  Docs: 19043
2    Secs: 80.79  Docs: 19043
3    Secs: 81.12  Docs: 19043
4    Secs: 84.68  Docs: 19043
5    Secs: 81.78  Docs: 19043
6    Secs: 79.65  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.5.0 Power Macintosh
Mean: 80.59 secs
Truncated mean (4 kept, 2 discarded): 80.83 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$


RAW DATA - mergefactor experiment
============================================================

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer -reps 6

---------------------------------------------------
1   Secs: 127.05  Docs: 19043
2   Secs: 125.50  Docs: 19043
3   Secs: 125.44  Docs: 19043
4   Secs: 124.53  Docs: 19043
5   Secs: 124.10  Docs: 19043
6   Secs: 121.57  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 124.70 secs
Truncated mean (4 kept, 2 discarded): 124.89 secs
---------------------------------------------------

slothbear:~/Desktop/ks/t/benchmarks marvin$ vim indexers/LuceneIndexer.javaslothbear:~/Desktop/ks/t/benchmarks marvin$ javac -d . indexers/LuceneIndexer.javaslothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer -reps 6

---------------------------------------------------
1   Secs: 89.91  Docs: 19043
2   Secs: 87.59  Docs: 19043
3   Secs: 88.51  Docs: 19043
4   Secs: 88.59  Docs: 19043
5   Secs: 87.97  Docs: 19043
6   Secs: 86.75  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 88.22 secs
Truncated mean (4 kept, 2 discarded): 88.17 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Benchmarking results

Reply via email to