Looks like the attachment for the algorithm is missing from last email. I have
pasted the text here. Thanks in advance for any help.
#Start of the wikipedia-default.alg file
merge.factor=mrg:10:10:10
max.field.length=2147483647
#max.buffered=buf:10:10:100:100
ram.flush.mb=flush:16:16:16
compound=true
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
doc.stored=true
doc.tokenized=true
doc.term.vector=false
log.step=5000
docs.file=temp/enwiki-20070527-pages-articles.xml
content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
# task at this depth or less would print when they start
task.max.depth.log=2
log.queries=false
#
-------------------------------------------------------------------------------------
{ "Rounds"
ResetSystemErase
{ "Populate"
CreateIndex
{ "MAddDocs" AddDoc > : 200000
CloseIndex
}
NewRound
} : 3
RepSumByName
RepSumByPrefRound MAddDocs
#End of wikipedia-default.alg file
Thanks,
Sean
From: Sean Tong [mailto:[email protected]]
Sent: Sunday, December 11, 2011 11:54 PM
To: [email protected]
Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Hi,
We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I
have been running benchmark tests that come with Lucence. To my surprise, I
found that the indexing in 3.5.0 is significant slower than 2.4.1 for the
Wikipedia data.
Attached is the algorithm for the tests. The tests used default Lucence
settings for flush memory size and merge factor. 512M memory was used for the
tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
The command:
%ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
Here are the test results:
Lucece 2.4.1
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about
3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s
elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,609.1
124.29 89,218,496 241,631,232
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 1,746.4
- - 114.52 - 102,365,864 - 241,762,304
[java] MAddDocs_200000 2 16.00 10 1 200000 1,566.8
127.65 69,428,144 174,194,688
Lucene 2.9.4
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3
out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s
elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,046.49
191.12 82,676,152 139,657,216
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - 1,165.35
- - 171.62 - 119,364,128 - 156,762,112
[java] MAddDocs_200000 2 16.00 10 1 200000 1,245.86
160.53 50,361,760 137,625,600
Lucene 3.5.0
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3
out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s
elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 676.48
295.65 70,917,592 129,695,744
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 626.13
- - 319.42 - 50,329,552 - 94,240,768
[java] MAddDocs_200000 2 16.00 10 1 200000 687.68
290.83 57,732,640 92,864,512
The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss
any settings or configurations?
Thanks,
Sean