[jira] [Commented] (LUCENE-7857) CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded

Robert Muir (JIRA) Tue, 30 May 2017 16:41:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030373#comment-16030373
 ]


Robert Muir commented on LUCENE-7857:
-------------------------------------

my opinion: behavior should be consistent with StandardTokenizer & co.

I don't think we should do heroic efforts to do great things with too-long 
tokens. If someone wants maxTokenLen of 3 or something, then i think its better 
to look at n-grams for that case.

> CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens 
> when the max length is exceeded
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7857
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7857
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>
> Assigning to myself to not lose track of it.
> LUCENE-7705 introduced the ability to define the allowable token length for 
> these tokenizers other than hard-code it to 255. It's always been the case 
> that when the hard-coded limit was exceeded, multiple tokens would be 
> emitted. However, the tests for LUCENE-7705 exposed a problem.
> Suppose the max length is 3 and the doc contains "letter". Two tokens are 
> emitted and indexed: "let" and "ter".
> Now suppose the search is for "lett". If the default operator is AND or 
> phrase queries are constructed the query fails since the tokens emitted are 
> "let" and "t". Only if the operator is OR is the document found, and even 
> then it won't be correct since searching for "lett" would match a document 
> indexed with "bett" because it would match on the bare "t".
> Proposal: 
> The remainder of the token should be ignored when maxTokenLen is exceeded.
> [~rcmuir][~steve_rowe][~tomasflobbe] comments? Again, this behavior was not 
> introduced by LUCENE-7705, it's just that it would be very hard to notice 
> with the default 255 char limit.
> I'm not quite sure why master generates a parsed query of:
> field:let field:t
> and 6x generates
> field:"let t"
> so the tests succeeded on master but not on 6x....



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7857) CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded

Reply via email to