Re: Benchmarking results

Marvin Humphrey Sun, 09 Apr 2006 22:29:46 -0700


On Apr 7, 2006, at 10:05 AM, Doug Cutting wrote:

> Another axis that I don't think you're yet measuring is how thingschange as

> the index grows.

There are lots of axes that I haven't measured yet, but I do have tomove onto other things sooner or later. :) Running decent scientificexperiments is

painstaking work.  I have thrown away many, many hours of invalid data.

Besides, my laptop deserves a break. I've been running it flat outfor so

long, it thinks it's a space heater.

> What happens with 10k, 100k and 1M and 10M documents?

There are 19043 docs in the extracted Reuters corpus, comprising alittle over16 MB of content. I've set up the benchmarking apps so that if youspecify

more than 19043 docs on the command line, it loops back through the
collection, allowing an arbitrarily large number of docs to be indexed.

Here are results for 100,000 docs:

config truncated mean secs (6 reps) max memory (1rep)--------------------------------------------------------------------------

Lucene / JVM 1.4                       356 secs              224 MB
KinoSearch / Perl 5.8.8                407 secs               30 MB

... and 1 million docs:

config truncated mean secs (6 reps) max memory (1rep)--------------------------------------------------------------------------

Lucene / JVM 1.4                      4013                  skipped
KinoSearch / Perl 5.8.8               4776                  skipped

I'll try to get the 10 million benchmark done, but it's going to be alittletough, as I'm running these on my main working laptop. I can predictthe

trend, though, as we pile more and more docs on KinoSearch 0.09_03.

At 19000 docs and a mem_threshold of 16 MB, the external sortingalgorithmflushes 6 times, so 6 sorted runs must be merged when the postingssort poolgets sorted and the tis/tii/frq/prx files get written. At 1 milliondocsand a mem_threshold of 16 MB, the flush happens somewhere in theneighborhood

of 300 times, so c. 300 sorted runs must be merged.

At 10 million docs and a mem_threshold of 16 MB, we should seesomething like3000 sorted runs. That's probably too many, and if it isn't,eventually it

will be.

When the external sorter flips from feeding to fetching mode, itestablishes abuffer for each run. In KinoSearch 0.09_03, each buffer is allowedto consume

memory according to this formula:

sortex->run_cache_limit = (sortex->mem_threshold / 2) / sortex->num_runs;

When there's 6 runs, each buffer can eat 1.33 MB, but when there's300 runs,

each buffer only gets around 25 kB.

The primary problem with external sorting algorithms that use an N-way mergeis that they tend to be IO bound [1, 2]: between recovering data fromthe sortedruns and writing to the final output, the disk has to seek all overthe place.Therefore, it's important that the each run's buffer be fairlygenerous, to

minimize the number of refills.

I'll wager that at 3000 runs and 2.5 kB buffers, the external sorter is
going to seize up.  Fortunately there's an easy solution:

Use more than 16 MB of ram.

Earlier versions of KinoSearch used a different external sorter whichset afixed buffer size of 32 kB of content. I'll fix 0.09 before theofficial

release to set a floor of 64 kB ram usage for the buffers, and sometime
hence consider how to expose mem_threshold in the public API.

> There are typically knees in search engine performance curves when
> indexes get substantially larger than RAM, and behaviour on either
> side of the knee may differ with different indexing strategies.

As far as KinoSearch is concerned, the index was substantially larger
than RAM when it was only 19000 docs, so we're well past the knee.

Arguably, these tests have been comparing apples and crabapples,because Ihaven't forced KinoSearch to use more RAM. As mentioned earlier, theroughanalogue to maxMergeDocs is the RAM that the in-memory sort pool isallowedto consume before a run must be written to disk -- 16 MB, bydefault. I can

set it higher, but that's not part of the public API, so it seemed like

cheating to do so. Here's how things shake out when KinoSearch ishacked to

use maximum RAM...

19043 docs, body neither stored nor vectorized:

config truncated mean secs (6 reps) max memory (1rep)--------------------------------------------------------------------------

Lucene / JVM 1.4                      43.68 secs                79 MB
KinoSearch / Perl 5.8.8               63.89 secs               230 MB

19043 docs, body stored and vectorized:

config truncated mean secs (6 reps) max memory (1rep)--------------------------------------------------------------------------

KinoSearch / Perl 5.8.8               69.12 secs               236 MB
Lucene / JVM 1.4                      71.96 secs               118 MB

Perhaps in my desire to give Lucene every advantage, I've goneoverboard andhobbled KinoSearch too much. My main interest is in comparingalgorithms, and

in retrospect it's a more realistic comparison if I up KinoSearch's RAM,

private API hack or no. However, I knew that at least some of thepeople

reading these benchmarking reports would see "Cage Match! Lucene vs.

KinoSearch!" and not pay attention to the subtler, more interestingqualitativedifferences: where one or the other does better and why. It seemedimportantto take the highest high road so that no controversy could ever ariseas towhether one engine had been given a leg up. I'm just not worriedabout the

speed that KinoSearch is ultimately going to attain.  I'm intimately

acquainted with where the bottlenecks tend to occur in Perl, butKinoSearch isnot a pure Perl library -- it's an extension to Perl written in Perland C,with some of the performance-critical work done by C. (Call it a "Clibrarywith a Perl interface", if you like -- that's close enough.) Thisapproach hasits drawbacks, but now that after much experimentation and effort allor nearlyall of the algorithmic issues have been solved, speed isn't one ofthem. Thereare numerous optimizations still to be done, so it's going to getfaster.

The interesting thing to me about the unlimited-memory results aboveis how theresults change when there's stored and vectorized data. A tricked-out Lucene1.9.1 indexer seems to outperform a tricked-out KinoSearch 0.09_03indexer withregards to inverting documents. However, when there's stored/vectorized data,Lucene appears to have some overhead that KinoSearch doesn't.Corroboratingdata from additional independent experiments would be nice to have,but I thinkat this point we have enough to entertain the hypothesis that theKinoSearchmerge model, considered in isolation, is simply a more efficientalgorithm than

the current Lucene merge model.  I suspect that were Lucene to adopt it,
indexing speed would improve.

* The store field data and term vectors data are only writtenonce, rather

    than shuffled around with each merge.

* The serialized postings are plain old byte strings which can becompared

    using memcmp, so there's no OO overhead for the comparison routine.

* The streams in an N-way merge are more predictable and thereforeeasier to

    optimize for IO efficiency.

* External sorting is a well-studied problem and existingscholarship can be

    leveraged.

That's my two cents, and my contribution.

Thanks for the critique,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] "Parallel Out-of-Core Sorting: The Third Way", Geeta Chaudhry
    Section 8.1, "Merging based algorithms"
    http://www.cs.dartmouth.edu/reports/abstracts/TR2004-517/
[2] "Asynchronous Parallel Disk Sorting", Dementiev/Sanders
    Section 2.2 "Multi-way merging"
    http://i10www.ira.uka.de/dementiev/files/DS03.pdf

RAW DATA - 100,000 docs
==================================================

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer -docs 100000 -reps 6

---------------------------------------------------
1   Secs: 384.62  Docs: 100000
2   Secs: 359.83  Docs: 100000
3   Secs: 354.13  Docs: 100000
4   Secs: 354.63  Docs: 100000
5   Secs: 355.56  Docs: 100000
6   Secs: 354.18  Docs: 100000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 360.49 secs
Truncated mean (4 kept, 2 discarded): 356.05 secs
---------------------------------------------------

slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/benchmarks/slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/perl -Mblib indexers/kinosearch_indexer.plx --docs=100000 --reps=6

------------------------------------------------------------
1    Secs: 406.87  Docs: 100000
2    Secs: 409.73  Docs: 100000
3    Secs: 405.33  Docs: 100000
4    Secs: 407.35  Docs: 100000
5    Secs: 406.76  Docs: 100000
6    Secs: 409.45  Docs: 100000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 407.58 secs
Truncated mean (4 kept, 2 discarded): 407.61 secs
------------------------------------------------------------

RAW DATA - 1,000,000 docs
==================================================

slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/perl -Mblib indexers/kinosearch_indexer.plx --docs=1000000 --reps=6;cd ~/Desktop/ks/t/benchmarks/; java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer -docs 1000000 -reps 6

------------------------------------------------------------
1    Secs: 4590.71  Docs: 1000000
2    Secs: 4694.36  Docs: 1000000
3    Secs: 4976.30  Docs: 1000000
4    Secs: 4760.99  Docs: 1000000
5    Secs: 4801.72  Docs: 1000000
6    Secs: 4834.97  Docs: 1000000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 4776.51 secs
Truncated mean (4 kept, 2 discarded): 4773.01 secs
------------------------------------------------------------
---------------------------------------------------
1   Secs: 4,023.33  Docs: 1000000
2   Secs: 4,003.18  Docs: 1000000
3   Secs: 4,012.21  Docs: 1000000
4   Secs: 4,015.97  Docs: 1000000
5   Secs: 4,010.61  Docs: 1000000
6   Secs: 4,013.39  Docs: 1000000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 4,013.12 secs
Truncated mean (4 kept, 2 discarded): 4,013.04 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$


RAW DATA -- KinoSearch large RAM
================================

slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/perl -Mblib indexers/kinosearch_indexer.plx --reps=6

------------------------------------------------------------
1    Secs: 64.86  Docs: 19043
2    Secs: 68.95  Docs: 19043
3    Secs: 67.98  Docs: 19043
4    Secs: 69.92  Docs: 19043
5    Secs: 71.56  Docs: 19043
6    Secs: 69.63  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 68.82 secs
Truncated mean (4 kept, 2 discarded): 69.12 secs
------------------------------------------------------------

slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/perl -Mblib indexers/kinosearch_indexer.plx --reps=6

------------------------------------------------------------
1    Secs: 60.00  Docs: 19043
2    Secs: 63.35  Docs: 19043
3    Secs: 66.77  Docs: 19043
4    Secs: 63.82  Docs: 19043
5    Secs: 64.04  Docs: 19043
6    Secs: 64.37  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 63.72 secs
Truncated mean (4 kept, 2 discarded): 63.89 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Benchmarking results

Reply via email to