Thanks Simon for your response.
I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3
out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s
elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 715.76
279.42 48,828,144 128,057,344
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 679.04
- - 294.53 - 68,321,424 - 85,721,088
[java] MAddDocs_200000 2 16.00 10 1 200000 761.95
262.49 63,139,256 91,881,472
The performance is slightly better than the one using StandardAnalyzer, but
this is still much worse than the performance with 2.4.1.
Sean
-----Original Message-----
From: Simon Willnauer [mailto:[email protected]]
Sent: Monday, December 12, 2011 12:20 PM
To: [email protected]
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
hey,
can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9
and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the
comparison.
simon
On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <[email protected]> wrote:
> Looks like the attachment for the algorithm is missing from last email. I
> have pasted the text here. Thanks in advance for any help.
>
> #Start of the wikipedia-default.alg file
>
> merge.factor=mrg:10:10:10
> max.field.length=2147483647
> #max.buffered=buf:10:10:100:100
> ram.flush.mb=flush:16:16:16
>
> compound=true
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> directory=FSDirectory
>
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=false
> log.step=5000
>
> docs.file=temp/enwiki-20070527-pages-articles.xml
>
> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentS
> ource
>
> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=2
>
> log.queries=false
> #
> ----------------------------------------------------------------------
> ---------------
>
> { "Rounds"
>
> ResetSystemErase
>
> { "Populate"
> CreateIndex
> { "MAddDocs" AddDoc > : 200000
> CloseIndex
> }
>
> NewRound
>
> } : 3
>
> RepSumByName
> RepSumByPrefRound MAddDocs
>
> #End of wikipedia-default.alg file
>
> Thanks,
>
> Sean
>
>
> From: Sean Tong [mailto:[email protected]]
> Sent: Sunday, December 11, 2011 11:54 PM
> To: [email protected]
> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> Hi,
>
> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0.
> I have been running benchmark tests that come with Lucence. To my surprise,
> I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the
> Wikipedia data.
>
> Attached is the algorithm for the tests. The tests used default Lucence
> settings for flush memory size and merge factor. 512M memory was used for
> the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
>
> The command:
> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>
> Here are the test results:
>
> Lucece 2.4.1
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round
> (3 about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun
> rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 1,609.1 124.29 89,218,496 241,631,232
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
> 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
>
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 1,566.8 127.65 69,428,144 174,194,688
>
>
> Lucene 2.9.4
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
> about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun
> rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 1,046.49 191.12 82,676,152 139,657,216
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 -
> 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
>
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 1,245.86 160.53 50,361,760 137,625,600
>
> Lucene 3.5.0
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
> about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun
> rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 676.48 295.65 70,917,592 129,695,744
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
> 626.13 - - 319.42 - 50,329,552 - 94,240,768
>
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 687.68 290.83 57,732,640 92,864,512
>
>
> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I
> miss any settings or configurations?
>
> Thanks,
>
> Sean
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]