[jira] [Commented] (LUCENE-7857) CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded

Steve Rowe (JIRA) Tue, 30 May 2017 18:08:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030459#comment-16030459
 ]


Steve Rowe commented on LUCENE-7857:
------------------------------------

I agree with Robert.

See my answer to a question about why StandardTokenizer effectively splits 
tokens that are longer than maxTokenLength in this recent java-user mailing 
list thread: 
[https://lists.apache.org/thread.html/42af955be9522cff0d28b47d7fa723d90846ad011157503fcf687f99@%3Cjava-user.lucene.apache.org%3E].

The workaround I outlined on that thread would work here too: set 
maxTokenLength super-high, then use LengthFilter to remove tokens longer than 
what you want to keep.

> CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens 
> when the max length is exceeded
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7857
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7857
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>
> Assigning to myself to not lose track of it.
> LUCENE-7705 introduced the ability to define the allowable token length for 
> these tokenizers other than hard-code it to 255. It's always been the case 
> that when the hard-coded limit was exceeded, multiple tokens would be 
> emitted. However, the tests for LUCENE-7705 exposed a problem.
> Suppose the max length is 3 and the doc contains "letter". Two tokens are 
> emitted and indexed: "let" and "ter".
> Now suppose the search is for "lett". If the default operator is AND or 
> phrase queries are constructed the query fails since the tokens emitted are 
> "let" and "t". Only if the operator is OR is the document found, and even 
> then it won't be correct since searching for "lett" would match a document 
> indexed with "bett" because it would match on the bare "t".
> Proposal: 
> The remainder of the token should be ignored when maxTokenLen is exceeded.
> [~rcmuir][~steve_rowe][~tomasflobbe] comments? Again, this behavior was not 
> introduced by LUCENE-7705, it's just that it would be very hard to notice 
> with the default 255 char limit.
> I'm not quite sure why master generates a parsed query of:
> field:let field:t
> and 6x generates
> field:"let t"
> so the tests succeeded on master but not on 6x....



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7857) CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded

Reply via email to