[
https://issues.apache.org/jira/browse/LUCENE-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743183#action_12743183
]
Shai Erera commented on LUCENE-1794:
------------------------------------
We only need getTokenizer because TokenStream.reset() does not accept a Reader.
If we could introduce such method on TokenStream, we wouldn't need to refer to
Tokenizer directly.
bq. do you have any ideas on the back compat issues?
Well it's a bit trickier ... today we call reusableTokenStream in our indexing
code, and either get a new instance, or a reused instance. We cannot change
Analyzer's default behavior, which returns a new instance (unless we're willing
to break back-compat), because Analyzers that did not override
reusableTokenStream, may break if we start reusing the instance by default (for
example if they add two fields to a document w/ reusableTokenStream called
twice).
Also, deprecate reusableTokenStream and define a new one (say
reuseTokenStream), and move to use it is not good either, since we want its
default impl to reuse the token stream, and impls that did not override it may
break.
So how about if we create a new abstract ReusingAnalyzer which impls
reusableTokenStream to always reuse it. And we add Streams to Analyzer as a
protected static class. That way, Analyzers that don't care about reuse, can
still extend Analyzer. Analyzers which care about reuse and are fine w/
ReusingAnalyzer's impl, can move to extend it. And Analyzers that care about
reuse but want their reuse to be done differently can choose to extend
ReusingAnalyzer, or Analyzer.
Back-compat wise, we're safe since:
# Existing Lucene Analyzers that reuse can be changed to extend ReusingAnalyzer.
# Existing Analyzers (outside Lucene code) either override or not
reusableTokenStream, and therefore won't break.
# Our indexing code will still call reusableTokenStream, no change here.
# Any code out there which traverses an Analyzer by calling reusableTokenStream
does not need to change anything.
I think that'd work?
> implement reusableTokenStream for all contrib analyzers
> -------------------------------------------------------
>
> Key: LUCENE-1794
> URL: https://issues.apache.org/jira/browse/LUCENE-1794
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch,
> LUCENE-1794.patch, LUCENE-1794.patch
>
>
> most contrib analyzers do not have an impl for reusableTokenStream
> regardless of how expensive the back compat reflection is for indexing speed,
> I think we should do this to mitigate any performance costs. hey, overall it
> might even be an improvement!
> the back compat code for non-final analyzers is already in place so this is
> easy money in my opinion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]