On Wed, Dec 07, 2011 at 10:42:57PM +0100, Nick Wellnhofer wrote:
> Some quick and completely unscientific benchmarks, indexing 1000 times
> the same 10K ASCII document:
>
> RT = RegexTokenizer
> ST = StandardTokenizer
> CF = CaseFolder
> N = Normalizer
>
> RT: 2.177s
> RT+CF: 3.964s
> RT+N: 2.556s
> ST: 1.551s
> ST+CF: 3.357s
> ST+N: 1.931s
These numbers are great, and in line with some benchmarks I was also running
today (raw data below). StandardTokenizer and Normalizer are considerably
faster than RegexTokenizer and the current implementation of CaseFolder, and
thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
SnowballStemmer) by a wide margin:
Time to index 1000 docs (10 reps, truncated mean)
=================================================
PolyAnalyzer .576 secs
EasyAnalyzer .436 secs
Can't wait for StandardTokenizer to land in trunk!
> It's also interesting that moving the tokenizer in front of the case
> folder or normalizer always gave me faster results.
Yes, I get the same results. When I first saw the effect, I thought it might
be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
that CaseFolder behaved that way. I have no explanation, but the results
certainly argue for starting off analysis with tokenization.
Marvin Humphrey
===========================================================================
~/projects/lucy_196/perl $ # RegexTokenizer, pattern => \S+
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.300 Docs: 1000
2 Secs: 0.299 Docs: 1000
3 Secs: 0.297 Docs: 1000
4 Secs: 0.300 Docs: 1000
5 Secs: 0.298 Docs: 1000
6 Secs: 0.299 Docs: 1000
7 Secs: 0.297 Docs: 1000
8 Secs: 0.296 Docs: 1000
9 Secs: 0.300 Docs: 1000
10 Secs: 0.298 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.298 secs
Truncated mean (6 kept, 4 discarded): 0.298 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # StandardTokenizer
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.254 Docs: 1000
2 Secs: 0.251 Docs: 1000
3 Secs: 0.253 Docs: 1000
4 Secs: 0.251 Docs: 1000
5 Secs: 0.253 Docs: 1000
6 Secs: 0.252 Docs: 1000
7 Secs: 0.253 Docs: 1000
8 Secs: 0.253 Docs: 1000
9 Secs: 0.251 Docs: 1000
10 Secs: 0.254 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.253 secs
Truncated mean (6 kept, 4 discarded): 0.253 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # CaseFolder
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.160 Docs: 1000
2 Secs: 0.159 Docs: 1000
3 Secs: 0.160 Docs: 1000
4 Secs: 0.159 Docs: 1000
5 Secs: 0.160 Docs: 1000
6 Secs: 0.158 Docs: 1000
7 Secs: 0.161 Docs: 1000
8 Secs: 0.158 Docs: 1000
9 Secs: 0.160 Docs: 1000
10 Secs: 0.158 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.159 secs
Truncated mean (6 kept, 4 discarded): 0.159 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # Normalizer
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.150 Docs: 1000
2 Secs: 0.148 Docs: 1000
3 Secs: 0.150 Docs: 1000
4 Secs: 0.149 Docs: 1000
5 Secs: 0.150 Docs: 1000
6 Secs: 0.148 Docs: 1000
7 Secs: 0.150 Docs: 1000
8 Secs: 0.148 Docs: 1000
9 Secs: 0.151 Docs: 1000
10 Secs: 0.148 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.149 secs
Truncated mean (6 kept, 4 discarded): 0.149 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # PolyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.577 Docs: 1000
2 Secs: 0.577 Docs: 1000
3 Secs: 0.579 Docs: 1000
4 Secs: 0.576 Docs: 1000
5 Secs: 0.576 Docs: 1000
6 Secs: 0.575 Docs: 1000
7 Secs: 0.576 Docs: 1000
8 Secs: 0.575 Docs: 1000
9 Secs: 0.586 Docs: 1000
10 Secs: 0.575 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.577 secs
Truncated mean (6 kept, 4 discarded): 0.576 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # EasyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.437 Docs: 1000
2 Secs: 0.434 Docs: 1000
3 Secs: 0.436 Docs: 1000
4 Secs: 0.437 Docs: 1000
5 Secs: 0.436 Docs: 1000
6 Secs: 0.436 Docs: 1000
7 Secs: 0.441 Docs: 1000
8 Secs: 0.436 Docs: 1000
9 Secs: 0.435 Docs: 1000
10 Secs: 0.435 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.436 secs
Truncated mean (6 kept, 4 discarded): 0.436 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # [ Normalizer, StandardTokenizer,
SnowballStemmer(en) ]
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.470 Docs: 1000
2 Secs: 0.471 Docs: 1000
3 Secs: 0.472 Docs: 1000
4 Secs: 0.472 Docs: 1000
5 Secs: 0.477 Docs: 1000
6 Secs: 0.470 Docs: 1000
7 Secs: 0.468 Docs: 1000
8 Secs: 0.470 Docs: 1000
9 Secs: 0.471 Docs: 1000
10 Secs: 0.470 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.471 secs
Truncated mean (6 kept, 4 discarded): 0.471 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # [ RegexTokenizer, CaseFolder, SnowballStemmer(en) ]
~/projects/lucy_196/perl $ vim
../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ perl -Mblib
../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.555 Docs: 1000
2 Secs: 0.558 Docs: 1000
3 Secs: 0.557 Docs: 1000
4 Secs: 0.555 Docs: 1000
5 Secs: 0.565 Docs: 1000
6 Secs: 0.556 Docs: 1000
7 Secs: 0.555 Docs: 1000
8 Secs: 0.558 Docs: 1000
9 Secs: 0.555 Docs: 1000
10 Secs: 0.553 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.557 secs
Truncated mean (6 kept, 4 discarded): 0.556 secs
------------------------------------------------------------
~/projects/lucy_196/perl $