[ 
https://issues.apache.org/jira/browse/LUCENE-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7355:
---------------------------------
    Attachment: LUCENE-7355.patch

bq.  it appears you accidentally included other WIP

Sorry I probably generated the patch against the wrong base commit, hence these 
unrelated changes.

bq. Why create a StringTokenStream; isn't KeywordTokenizer fine? Oh I see 
that's in another module... kinda seems like a generic utility that should be 
in core to me IMO.

I'd be fine to have KeywordTokenizer in core too, let's discuss it in another 
issue and then potentially cut over to it if it makes it to core?

bq. An easy optimization is to check if initReaderForNormalization returns the 
input StringReader. If so, simply set filteredText to text.

The way #normalize works is indeed not very efficient at the moment. In 
addition to this, it does not cache its analysis chain like we do for 
#tokenStream. But it's probably ok since this method should not be called as 
intensively as #tokenStream? (we can still improve in the future if this proves 
to be a bottleneck)

bq. It's a shame to call createComponents just to get the AttributeFactory

Agreed, this one annoys me too. I initially wanted to add a method but this is 
a pity since this information is already available. That said, maybe the method 
approach is better since borrowing the attribute factory from the regular 
analysis chain makes us close the token stream before it has been consumed, 
which some analysis chains might not like. I updated the patch.

bq. I suppose a separate issue might be for Solr to do this when someone 
configures a custom Analyzer.

Solr already solves this problem in a different way by having a different 
analyzer for multi-term queries which is computed using 
MultiTermAwareComponent. I agree it would be nice for it to switch to 
Analyzer#normalize but this would have the drawback that it would either 
require to drop support for configuring a custom multi-term analyzer or the 
integration would be a bit weird, ie. it would have to use Analyzer.tokenStream 
on the multiterm analyzer if it is configured or fall back to 
Analyzer.normalize on the default analyzer if no multi-term analyzer is 
configured - which might be controversial.

> Leverage MultiTermAwareComponent in query parsers
> -------------------------------------------------
>
>                 Key: LUCENE-7355
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7355
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7355.patch, LUCENE-7355.patch, LUCENE-7355.patch, 
> LUCENE-7355.patch, LUCENE-7355.patch, LUCENE-7355.patch
>
>
> MultiTermAwareComponent is designed to make it possible to do the right thing 
> in query parsers when in comes to analysis of multi-term queries. However, 
> since query parsers just take an analyzer and since analyzers do not 
> propagate the information about what to do for multi-term analysis, query 
> parsers cannot do the right thing out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to