[
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516775
]
Michael McCandless commented on LUCENE-966:
-------------------------------------------
I agree, let's try to perfectly match the tokens of the old
StandardAnalyzer so we have a way-faster drop-in replacement.
The speedups of JFlex are amazing: based on a quick test, with JFlex +
patch from LUCENE-969, the new StandardAnalyzer is only 2.09X slower
than WhitespaceAnalyzer even though it's doing so much more ...
> Finally, when it comes to the initialization time of the new
> tokenizer -- according to the JFlex documentation, some time is
> required to unpack the transition tables. But the unpacking takes
> place during the initialization of static fields, so once the class
> is loaded the overhead should be negligible.
Yeah I'm baffled why it's that much slower, but on 100 token docs I
definitely see LUCENE-969 making things 84% faster but "only" 36%
faster if I use the full Wikipedia doc (which are much larger than 100
tokens on average). If we tested even smaller docs I think the gains
would be even more.
When I ran under the profiler it was the StandardTokenizerImpl
<init>(java.io.Reader) way on the top. Maybe it's the cost of new'ing
the 16 KB buffer each time?
In any event I think it's OK, so long as we get LUCENE-969 in, and
document the importance of using reusableTokenStream() API for better
performance.
> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt,
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several
> times) replacement for StandardAnalyzer. Will add a patch and a simple
> benchmark code in a while.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]