[
https://issues.apache.org/jira/browse/LUCENE-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030459#comment-16030459
]
Steve Rowe commented on LUCENE-7857:
------------------------------------
I agree with Robert.
See my answer to a question about why StandardTokenizer effectively splits
tokens that are longer than maxTokenLength in this recent java-user mailing
list thread:
[https://lists.apache.org/thread.html/42af955be9522cff0d28b47d7fa723d90846ad011157503fcf687f99@%3Cjava-user.lucene.apache.org%3E].
The workaround I outlined on that thread would work here too: set
maxTokenLength super-high, then use LengthFilter to remove tokens longer than
what you want to keep.
> CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens
> when the max length is exceeded
> ----------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-7857
> URL: https://issues.apache.org/jira/browse/LUCENE-7857
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
> Assignee: Erick Erickson
>
> Assigning to myself to not lose track of it.
> LUCENE-7705 introduced the ability to define the allowable token length for
> these tokenizers other than hard-code it to 255. It's always been the case
> that when the hard-coded limit was exceeded, multiple tokens would be
> emitted. However, the tests for LUCENE-7705 exposed a problem.
> Suppose the max length is 3 and the doc contains "letter". Two tokens are
> emitted and indexed: "let" and "ter".
> Now suppose the search is for "lett". If the default operator is AND or
> phrase queries are constructed the query fails since the tokens emitted are
> "let" and "t". Only if the operator is OR is the document found, and even
> then it won't be correct since searching for "lett" would match a document
> indexed with "bett" because it would match on the bare "t".
> Proposal:
> The remainder of the token should be ignored when maxTokenLen is exceeded.
> [~rcmuir][~steve_rowe][~tomasflobbe] comments? Again, this behavior was not
> introduced by LUCENE-7705, it's just that it would be very hard to notice
> with the default 255 char limit.
> I'm not quite sure why master generates a parsed query of:
> field:let field:t
> and 6x generates
> field:"let t"
> so the tests succeeded on master but not on 6x....
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]