[ 
https://issues.apache.org/jira/browse/LUCENE-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455008#comment-15455008
 ] 

Adrien Grand commented on LUCENE-7429:
--------------------------------------

bq. The issue here is mostly that we need to create a new TokenStream 
(StringTokenStream) for the normalization and we need to use the same attribute 
types.

Exactly. For instance if a term attribute produces utf-16 encoded tokens, 

bq. Although this is sometimes broken for use-cases, where TokenStreams create 
binary tokens. But those would never be normalized, I think (!?)

Do you mean that you cannot think of any use-case for using both a non-default 
term attribute and token filters in the same analysis chain? I am wondering 
about CJK analyzers since I think UTF16 stores CJK characters a bit more 
efficiently on average than UTF8 (I may be completely wrong, in which case 
please let me know) so users might be tempted to use a different term attribute 
impl?

> DelegatingAnalyzerWrapper should delegate normalization too
> -----------------------------------------------------------
>
>                 Key: LUCENE-7429
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7429
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 6.2
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7355.patch, LUCENE-7429.patch, LUCENE-7429.patch
>
>
> This is something that I overlooked in LUCENE-7355: 
> (Delegating)AnalyzerWrapper uses the default implementation of 
> initReaderForNormalization and normalize, meaning that by default the 
> normalization is a no-op. It should delegate to the wrapped analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to