[
https://issues.apache.org/jira/browse/LUCENE-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455008#comment-15455008
]
Adrien Grand commented on LUCENE-7429:
--------------------------------------
bq. The issue here is mostly that we need to create a new TokenStream
(StringTokenStream) for the normalization and we need to use the same attribute
types.
Exactly. For instance if a term attribute produces utf-16 encoded tokens,
bq. Although this is sometimes broken for use-cases, where TokenStreams create
binary tokens. But those would never be normalized, I think (!?)
Do you mean that you cannot think of any use-case for using both a non-default
term attribute and token filters in the same analysis chain? I am wondering
about CJK analyzers since I think UTF16 stores CJK characters a bit more
efficiently on average than UTF8 (I may be completely wrong, in which case
please let me know) so users might be tempted to use a different term attribute
impl?
> DelegatingAnalyzerWrapper should delegate normalization too
> -----------------------------------------------------------
>
> Key: LUCENE-7429
> URL: https://issues.apache.org/jira/browse/LUCENE-7429
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 6.2
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7355.patch, LUCENE-7429.patch, LUCENE-7429.patch
>
>
> This is something that I overlooked in LUCENE-7355:
> (Delegating)AnalyzerWrapper uses the default implementation of
> initReaderForNormalization and normalize, meaning that by default the
> normalization is a no-op. It should delegate to the wrapped analyzer.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]