Mark Harwood created LUCENE-8352:
------------------------------------

             Summary: Make TokenStreamComponents final
                 Key: LUCENE-8352
                 URL: https://issues.apache.org/jira/browse/LUCENE-8352
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
            Reporter: Mark Harwood


The current design is a little trappy. Any specialised subclasses of 
TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, 
UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap 
them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, 
ShingleAnalyzerWrapper and other examples in elasticsearch)_. 

The current design means each AnalyzerWrapper.wrapComponents() implementation 
discards any custom TokenStreamComponents and replaces it with one of its own 
choosing (a vanilla TokenStreamComponents class from examples I've seen).

This is a trap I fell into when writing a custom TokenStreamComponents with a 
custom setReader() and I wondered why it was not being triggered when wrapped 
by other analyzers.

If AnalyzerWrapper is designed to encourage composition it's arguably a mistake 
to also permit custom TokenStreamComponent subclasses  - the composition 
process does not preserve the choice of custom classes and any behaviours they 
might add. For this reason we should not encourage extensions to 
TokenStreamComponents (or if TSC extensions are required we should somehow mark 
an Analyzer as "unwrappable" to prevent lossy compositions).

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to