[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Erick Erickson (JIRA) Tue, 30 May 2017 15:04:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030223#comment-16030223
 ]


Erick Erickson commented on LUCENE-7705:
----------------------------------------

OK, I see what's happening. I noted earlier that the way this has always been 
implemented, multiple tokens are emitted when the token length is exceeded. In 
this case, the token sent in the doc is "letter". So two tokens are emitted:
"let" and "ter". With positions incremented between I think.

The search is against "lett". For some reason, the parsed query in 6x is:
PhraseQuery(letter0:&quot;let t&quot;)

while in master it's:
letter0:let letter0:t

Even this is wrong, it just happens to succeed because the default operator is 
OR, so the fact that and the tokens in the index do not include a bare "t" 
finds the doc by chance, not design.

I think the right solution is to stop emitting tokens for a particular value 
once maxTokenLen is exceeded. I'll raise a new JIRA and we can debate it here.

This is _not_ any change in behavior resulting from the changes in this JIRA, 
the tests just expose something that's always been the case but nobody's 
noticed.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7705
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: master (7.0), 6.7
>
>         Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Reply via email to