On Apr 7, 2006, at 10:05 AM, Doug Cutting wrote:
> Another axis that I don't think you're yet measuring is how things
change as
> the index grows.
There are lots of axes that I haven't measured yet, but I do have to
move on
to other things sooner or later. :) Running decent scientific
experiments is
painstaking work. I have thrown away many, many hours of invalid data.
Besides, my laptop deserves a break. I've been running it flat out
for so
long, it thinks it's a space heater.
> What happens with 10k, 100k and 1M and 10M documents?
There are 19043 docs in the extracted Reuters corpus, comprising a
little over
16 MB of content. I've set up the benchmarking apps so that if you
specify
more than 19043 docs on the command line, it loops back through the
collection, allowing an arbitrarily large number of docs to be indexed.
Here are results for 100,000 docs:
config truncated mean secs (6 reps) max memory (1
rep)
------------------------------------------------------------------------
--
Lucene / JVM 1.4 356 secs 224 MB
KinoSearch / Perl 5.8.8 407 secs 30 MB
... and 1 million docs:
config truncated mean secs (6 reps) max memory (1
rep)
------------------------------------------------------------------------
--
Lucene / JVM 1.4 4013 skipped
KinoSearch / Perl 5.8.8 4776 skipped
I'll try to get the 10 million benchmark done, but it's going to be a
little
tough, as I'm running these on my main working laptop. I can predict
the
trend, though, as we pile more and more docs on KinoSearch 0.09_03.
At 19000 docs and a mem_threshold of 16 MB, the external sorting
algorithm
flushes 6 times, so 6 sorted runs must be merged when the postings
sort pool
gets sorted and the tis/tii/frq/prx files get written. At 1 million
docs
and a mem_threshold of 16 MB, the flush happens somewhere in the
neighborhood
of 300 times, so c. 300 sorted runs must be merged.
At 10 million docs and a mem_threshold of 16 MB, we should see
something like
3000 sorted runs. That's probably too many, and if it isn't,
eventually it
will be.
When the external sorter flips from feeding to fetching mode, it
establishes a
buffer for each run. In KinoSearch 0.09_03, each buffer is allowed
to consume
memory according to this formula:
sortex->run_cache_limit = (sortex->mem_threshold / 2) / sortex-
>num_runs;
When there's 6 runs, each buffer can eat 1.33 MB, but when there's
300 runs,
each buffer only gets around 25 kB.
The primary problem with external sorting algorithms that use an N-
way merge
is that they tend to be IO bound [1, 2]: between recovering data from
the sorted
runs and writing to the final output, the disk has to seek all over
the place.
Therefore, it's important that the each run's buffer be fairly
generous, to
minimize the number of refills.
I'll wager that at 3000 runs and 2.5 kB buffers, the external sorter is
going to seize up. Fortunately there's an easy solution:
Use more than 16 MB of ram.
Earlier versions of KinoSearch used a different external sorter which
set a
fixed buffer size of 32 kB of content. I'll fix 0.09 before the
official
release to set a floor of 64 kB ram usage for the buffers, and sometime
hence consider how to expose mem_threshold in the public API.
> There are typically knees in search engine performance curves when
> indexes get substantially larger than RAM, and behaviour on either
> side of the knee may differ with different indexing strategies.
As far as KinoSearch is concerned, the index was substantially larger
than RAM when it was only 19000 docs, so we're well past the knee.
Arguably, these tests have been comparing apples and crabapples,
because I
haven't forced KinoSearch to use more RAM. As mentioned earlier, the
rough
analogue to maxMergeDocs is the RAM that the in-memory sort pool is
allowed
to consume before a run must be written to disk -- 16 MB, by
default. I can
set it higher, but that's not part of the public API, so it seemed like
cheating to do so. Here's how things shake out when KinoSearch is
hacked to
use maximum RAM...
19043 docs, body neither stored nor vectorized:
config truncated mean secs (6 reps) max memory (1
rep)
------------------------------------------------------------------------
--
Lucene / JVM 1.4 43.68 secs 79 MB
KinoSearch / Perl 5.8.8 63.89 secs 230 MB
19043 docs, body stored and vectorized:
config truncated mean secs (6 reps) max memory (1
rep)
------------------------------------------------------------------------
--
KinoSearch / Perl 5.8.8 69.12 secs 236 MB
Lucene / JVM 1.4 71.96 secs 118 MB
Perhaps in my desire to give Lucene every advantage, I've gone
overboard and
hobbled KinoSearch too much. My main interest is in comparing
algorithms, and
in retrospect it's a more realistic comparison if I up KinoSearch's RAM,
private API hack or no. However, I knew that at least some of the
people
reading these benchmarking reports would see "Cage Match! Lucene vs.
KinoSearch!" and not pay attention to the subtler, more interesting
qualitative
differences: where one or the other does better and why. It seemed
important
to take the highest high road so that no controversy could ever arise
as to
whether one engine had been given a leg up. I'm just not worried
about the
speed that KinoSearch is ultimately going to attain. I'm intimately
acquainted with where the bottlenecks tend to occur in Perl, but
KinoSearch is
not a pure Perl library -- it's an extension to Perl written in Perl
and C,
with some of the performance-critical work done by C. (Call it a "C
library
with a Perl interface", if you like -- that's close enough.) This
approach has
its drawbacks, but now that after much experimentation and effort all
or nearly
all of the algorithmic issues have been solved, speed isn't one of
them. There
are numerous optimizations still to be done, so it's going to get
faster.
The interesting thing to me about the unlimited-memory results above
is how the
results change when there's stored and vectorized data. A tricked-
out Lucene
1.9.1 indexer seems to outperform a tricked-out KinoSearch 0.09_03
indexer with
regards to inverting documents. However, when there's stored/
vectorized data,
Lucene appears to have some overhead that KinoSearch doesn't.
Corroborating
data from additional independent experiments would be nice to have,
but I think
at this point we have enough to entertain the hypothesis that the
KinoSearch
merge model, considered in isolation, is simply a more efficient
algorithm than
the current Lucene merge model. I suspect that were Lucene to adopt it,
indexing speed would improve.
* The store field data and term vectors data are only written
once, rather
than shuffled around with each merge.
* The serialized postings are plain old byte strings which can be
compared
using memcmp, so there's no OO overhead for the comparison routine.
* The streams in an N-way merge are more predictable and therefore
easier to
optimize for IO efficiency.
* External sorting is a well-studied problem and existing
scholarship can be
leveraged.
That's my two cents, and my contribution.
Thanks for the critique,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
[1] "Parallel Out-of-Core Sorting: The Third Way", Geeta Chaudhry
Section 8.1, "Merging based algorithms"
http://www.cs.dartmouth.edu/reports/abstracts/TR2004-517/
[2] "Asynchronous Parallel Disk Sorting", Dementiev/Sanders
Section 2.2 "Multi-way merging"
http://i10www.ira.uka.de/dementiev/files/DS03.pdf
RAW DATA - 100,000 docs
==================================================
slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -
XX:CompileThreshold=100 LuceneIndexer -docs 100000 -reps 6
---------------------------------------------------
1 Secs: 384.62 Docs: 100000
2 Secs: 359.83 Docs: 100000
3 Secs: 354.13 Docs: 100000
4 Secs: 354.63 Docs: 100000
5 Secs: 355.56 Docs: 100000
6 Secs: 354.18 Docs: 100000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 360.49 secs
Truncated mean (4 kept, 2 discarded): 356.05 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/
benchmarks/
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --docs=100000 --reps=6
------------------------------------------------------------
1 Secs: 406.87 Docs: 100000
2 Secs: 409.73 Docs: 100000
3 Secs: 405.33 Docs: 100000
4 Secs: 407.35 Docs: 100000
5 Secs: 406.76 Docs: 100000
6 Secs: 409.45 Docs: 100000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 407.58 secs
Truncated mean (4 kept, 2 discarded): 407.61 secs
------------------------------------------------------------
RAW DATA - 1,000,000 docs
==================================================
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --docs=1000000 --reps=6;
cd ~/Desktop/ks/t/benchmarks/; java -server -Xmx500M -
XX:CompileThreshold=100 LuceneIndexer -docs 1000000 -reps 6
------------------------------------------------------------
1 Secs: 4590.71 Docs: 1000000
2 Secs: 4694.36 Docs: 1000000
3 Secs: 4976.30 Docs: 1000000
4 Secs: 4760.99 Docs: 1000000
5 Secs: 4801.72 Docs: 1000000
6 Secs: 4834.97 Docs: 1000000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 4776.51 secs
Truncated mean (4 kept, 2 discarded): 4773.01 secs
------------------------------------------------------------
---------------------------------------------------
1 Secs: 4,023.33 Docs: 1000000
2 Secs: 4,003.18 Docs: 1000000
3 Secs: 4,012.21 Docs: 1000000
4 Secs: 4,015.97 Docs: 1000000
5 Secs: 4,010.61 Docs: 1000000
6 Secs: 4,013.39 Docs: 1000000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 4,013.12 secs
Truncated mean (4 kept, 2 discarded): 4,013.04 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$
RAW DATA -- KinoSearch large RAM
================================
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1 Secs: 64.86 Docs: 19043
2 Secs: 68.95 Docs: 19043
3 Secs: 67.98 Docs: 19043
4 Secs: 69.92 Docs: 19043
5 Secs: 71.56 Docs: 19043
6 Secs: 69.63 Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 68.82 secs
Truncated mean (4 kept, 2 discarded): 69.12 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1 Secs: 60.00 Docs: 19043
2 Secs: 63.35 Docs: 19043
3 Secs: 66.77 Docs: 19043
4 Secs: 63.82 Docs: 19043
5 Secs: 64.04 Docs: 19043
6 Secs: 64.37 Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 63.72 secs
Truncated mean (4 kept, 2 discarded): 63.89 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]