[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

Robert Muir (JIRA) Sun, 04 Mar 2018 11:17:14 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385297#comment-16385297
 ]


Robert Muir commented on LUCENE-8186:
-------------------------------------

Uwe: I agree with you. For "normalize" the tokenization is implicitly keyword. 
Tokenizers can't concern themselves with the syntax of wildcards or regex, 
sorry. They work on real text.

I also agree that LowerCaseTokenizer is stupid :) But its factory does the 
correct thing and returns a LowerCase*Filter* for these purposes!

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java#L70-L74

So I think CustomAnalyzer forgets to call this method on the TokenizerFactory, 
just in case the tokenizer is doing something like this. It seems like this is 
really easy to fix in CustomAnalyzer.normalize? And yeah, separately, we should 
deprecate LowerCaseTokenizer.

> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-8186
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8186
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
>     Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
>     //fails
>     assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
>     
>     //now try an integration test with the classic query parser
>     QueryParser p = new QueryParser("f", analyzer);
>     Query q = p.parse("Hello");
>     //passes
>     assertEquals(new TermQuery(new Term("f", "hello")), q);
>     q = p.parse("Hello*");
>     //fails
>     assertEquals(new PrefixQuery(new Term("f", "hello")), q);
>     q = p.parse("Hel*o");
>     //fails
>     assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

Reply via email to