[
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516745
]
Michael McCandless commented on LUCENE-966:
-------------------------------------------
I took the patch from here (to use jflex for StandardAnalyzer) and
merged it with the patch from LUCENE-969 (re-use Token & TokenStream)
to measure the net performance gains.
I measure the time to just tokenize all of Wikipedia using
StandardAnalyzer using contrib/benchmark plus patch from LUCENE-967
(test details are described in LUCENE-969).
With the jflex patch it takes 646 sec (best of 2 runs); when I then
merge in the patch from LUCENE-969 it takes 455 sec. Subtracting off
the time to just load all Wikipedia docs (= 112 sec) that gives net
additional speedup of 36% (534 sec -> 343 sec) when using LUCENE-969
in addition to jflex.
A couple other things I noticed:
* The init cost of jflex (StandardTokenizerImpl) seems to be fairly
high: when I repeat the above test with smallish docs (100 tokens
each) instead, the gain is around 84%. I think this just makes
the new reusableTokenStream() in LUCENE-969 important to commit.
* I'm seeing differing token counts with the jflex StandardAnalyzer
vs the current one; I think there is some difference here. I will
track down which tokens differ and post back...
> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt,
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several
> times) replacement for StandardAnalyzer. Will add a patch and a simple
> benchmark code in a while.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]