Mark Harwood created LUCENE-8352:
------------------------------------
Summary: Make TokenStreamComponents final
Key: LUCENE-8352
URL: https://issues.apache.org/jira/browse/LUCENE-8352
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Reporter: Mark Harwood
The current design is a little trappy. Any specialised subclasses of
TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer,
UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap
them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer,
ShingleAnalyzerWrapper and other examples in elasticsearch)_.
The current design means each AnalyzerWrapper.wrapComponents() implementation
discards any custom TokenStreamComponents and replaces it with one of its own
choosing (a vanilla TokenStreamComponents class from examples I've seen).
This is a trap I fell into when writing a custom TokenStreamComponents with a
custom setReader() and I wondered why it was not being triggered when wrapped
by other analyzers.
If AnalyzerWrapper is designed to encourage composition it's arguably a mistake
to also permit custom TokenStreamComponent subclasses - the composition
process does not preserve the choice of custom classes and any behaviours they
might add. For this reason we should not encourage extensions to
TokenStreamComponents (or if TSC extensions are required we should somehow mark
an Analyzer as "unwrappable" to prevent lossy compositions).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]