Hi Robert,
On Mar 15, 2013, at 11:29 AM, Robert Muir <[email protected]> wrote:
> 2013/2/28 Steve Rowe <[email protected]>:
>> EnglishAnalyzer has used PorterStemmer instead of the English Snowball
>> stemmer since it was created in 2010 as part of LUCENE-2055[2]. I think
>> this is an oversight: EnglishAnalyzer should incorporate the best English
>> stemmer we've got, and Martin Porter says the Porter2 stemmer is better[1].
>> Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do you
>> think?
>
> This was intentional actually. The default was a tradeoff of
> "benefits" (which affect less than 5% of english vocabulary, if you
> read around the snowball site), versus a much more significant
> performance difference as a "default".
>
> For example when i did tests of indexing both short and long texts
>
> http://find.searchhub.org/document/c1d3301b71dab5ca#46a8351089a98aec
>
> Thats overall indexing speed, not just text analysis.
>
> It might be that this guy is faster these days (we've done some
> improvements) too.
Thanks for the explanation.
I ran a lucene/benchmark alg comparing the two stemmers on trunk on my Macbook
Pro with Oracle Java 1.7.0_13, and it looks like the situation hasn't changed
much.
The original-algorithm Porter stemmer is 4 times faster than the
Porter2/English Snowball stemmer, resulting in 40% higher throughput in a full
English analysis pipeline.
So the default English stemmer choice is still valid IMO.
Here's porter-comparison.alg:
-----
content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
doc.tokenized=false
doc.body.tokenized=true
docs.dir=reuters-out
-AnalyzerFactory(name:original-porter-stemmer,StandardTokenizer,
StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter,
PorterStemFilter)
-AnalyzerFactory(name:porter2-stemmer,StandardTokenizer,
StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter,
SnowballPorterFilter(language:English))
-AnalyzerFactory(name:no-stemmer,StandardTokenizer,
StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter)
{ "Rounds"
-NewAnalyzer(original-porter-stemmer)
-ResetInputs
{ "Original Porter Stemmer" { ReadTokens > : 20000 }
-NewAnalyzer(porter2-stemmer)
-ResetInputs
{ "Porter2/English Stemmer" { ReadTokens > : 20000 }
-NewAnalyzer(no-stemmer)
-ResetInputs
{ "No Stemmer" { ReadTokens > : 20000 }
NewRound
} : 5
RepSumByNameRound
-----
And the results (regrouped; ordered by elapsedSec) - a "rec" is a token:
-----
Operation round recsPerRun rec/s elapsedSec
No Stemmer 2 1814029 1,234,873.38 1.47
No Stemmer 4 1814029 1,234,873.38 1.47
No Stemmer 1 1814029 1,230,684.50 1.47
No Stemmer 0 1814029 1,227,353.88 1.48
No Stemmer 3 1814029 1,226,524.00 1.48
Original Porter Stemmer 1 1814029 1,074,025.50 1.69
Original Porter Stemmer 4 1814029 1,065,196.12 1.70
Original Porter Stemmer 2 1814029 1,056,510.75 1.72
Original Porter Stemmer 3 1814029 1,030,698.31 1.76
Original Porter Stemmer 0 1814029 685,833.25 2.64
Porter2/English Stemmer 4 1814029 768,656.38 2.36
Porter2/English Stemmer 2 1814029 764,123.44 2.37
Porter2/English Stemmer 1 1814029 758,056.44 2.39
Porter2/English Stemmer 3 1814029 758,056.44 2.39
Porter2/English Stemmer 0 1814029 716,158.31 2.53
-----
Best of 5 results:
No Stemmer: 1.47s
Original Porter Stemmer: 1.69s - 1.47s = 0.22s
Porter2/English Stemmer: 2.36s - 1.47s = 0.89s
Throughput increase: (2.36s-1.69s)/1.69s * 100 = 40%
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]