[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Erick Erickson (JIRA) Wed, 10 May 2017 22:02:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005888#comment-16005888
 ]


Erick Erickson commented on LUCENE-7705:
----------------------------------------

One nit and one suggestion and one question in addition to Robert's comments:

The nit:
there's a pattern of a bunch of these:

updateJ("{\"add\":{\"doc\": 
{\"id\":1,\"letter\":\"letter\"}},\"commit\":{}}",null);
.
.
then:
assertU(commit());

It's unnecessary to do the commit with the updateJ calls. the commit at the end 
will take care of it all. It's a little less efficient to commit with each doc. 
Frankly I doubt that'd be measurable, performance wise, but let's take them out 
anyway.

The suggestion:
When we do stop accruing characters e.g. in CharTokenizer, let's log an INFO 
level message to that effect, something like

log.info("Splitting token at {} chars", maxTokenLen);

That way people will have a clue where to look. I think INFO is appropriate 
rather than WARN or ERROR since it's perfectly legitimate to truncate input, 
I'm thinking OCR text for instance. Maybe dump the token we've accumulated so 
far?

I worded it at "splitting" because (and there are tests for this) that the next 
token picks up where the first left off. So if the max length is 3, and the 
input is "whitespace", we get several tokens as a result, 

"whi", "tes", "pac", and "e". 

I suppose that means that the offsets are also incremented. Is that really what 
we want here? Or should we instead throw away the extra tokens? [~rcmuir], what 
do you think is correct? This is not a _change_ in behavior, the current code 
does the same thing just a hard-coded 255 limit. I'm checking if this is 
intended behavior. 

If we do want to throw away the extra, we could spin through the buffer until 
we encountered a non char then return the part < maxTokenLen. If we did that we 
could also log the entire token and the truncated version if we wanted.

Otherwise precommit passes and all tests pass.


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7705
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Reply via email to