[ 
https://issues.apache.org/jira/browse/LUCENE-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509635#comment-16509635
 ] 

Mark Harwood commented on LUCENE-8352:
--------------------------------------

My use case was a bit special. I had a custom reader that [dealt with 
hyperlinked 
text|https://github.com/elastic/elasticsearch/issues/29467#issuecomment-385393246]
 and stripped out the hyperlink markup using a custom Reader before feeding the 
remaining plain-text into tokenisation. The tricky bit was the extracted URLs 
would not be thrown away but passed to a special TokenFilter at the end of the 
chain to inject at the appropriate positions in the text token stream.

The workaround was a custom AnalyzerWrapper that overrode wrapReader (which is 
still invoked when wrapped) and then some ThreadLocal hackery to get my 
TokenFilter connected to the Reader's extracted urls. 

I'm not sure how common this sort of analysis is but before I reached this 
solution there was quite a detour trying to figure out why a custom 
TokenStreamComponents was not working when wrapped.

 

> Make TokenStreamComponents final
> --------------------------------
>
>                 Key: LUCENE-8352
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8352
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mark Harwood
>            Priority: Minor
>
> The current design is a little trappy. Any specialised subclasses of 
> TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, 
> UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap 
> them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, 
> ShingleAnalyzerWrapper and other examples in elasticsearch)_. 
> The current design means each AnalyzerWrapper.wrapComponents() implementation 
> discards any custom TokenStreamComponents and replaces it with one of its own 
> choosing (a vanilla TokenStreamComponents class from examples I've seen).
> This is a trap I fell into when writing a custom TokenStreamComponents with a 
> custom setReader() and I wondered why it was not being triggered when wrapped 
> by other analyzers.
> If AnalyzerWrapper is designed to encourage composition it's arguably a 
> mistake to also permit custom TokenStreamComponent subclasses  - the 
> composition process does not preserve the choice of custom classes and any 
> behaviours they might add. For this reason we should not encourage extensions 
> to TokenStreamComponents (or if TSC extensions are required we should somehow 
> mark an Analyzer as "unwrappable" to prevent lossy compositions).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to