[
https://issues.apache.org/jira/browse/LUCENE-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031588#comment-16031588
]
Erick Erickson commented on LUCENE-7857:
----------------------------------------
OK, then to paraphrase:
The behavior is correct, fix the tests ;)
> CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens
> when the max length is exceeded
> ----------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-7857
> URL: https://issues.apache.org/jira/browse/LUCENE-7857
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
> Assignee: Erick Erickson
>
> Assigning to myself to not lose track of it.
> LUCENE-7705 introduced the ability to define the allowable token length for
> these tokenizers other than hard-code it to 255. It's always been the case
> that when the hard-coded limit was exceeded, multiple tokens would be
> emitted. However, the tests for LUCENE-7705 exposed a problem.
> Suppose the max length is 3 and the doc contains "letter". Two tokens are
> emitted and indexed: "let" and "ter".
> Now suppose the search is for "lett". If the default operator is AND or
> phrase queries are constructed the query fails since the tokens emitted are
> "let" and "t". Only if the operator is OR is the document found, and even
> then it won't be correct since searching for "lett" would match a document
> indexed with "bett" because it would match on the bare "t".
> Proposal:
> The remainder of the token should be ignored when maxTokenLen is exceeded.
> [~rcmuir][~steve_rowe][~tomasflobbe] comments? Again, this behavior was not
> introduced by LUCENE-7705, it's just that it would be very hard to notice
> with the default 255 char limit.
> I'm not quite sure why master generates a parsed query of:
> field:let field:t
> and 6x generates
> field:"let t"
> so the tests succeeded on master but not on 6x....
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]