[
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005888#comment-16005888
]
Erick Erickson commented on LUCENE-7705:
----------------------------------------
One nit and one suggestion and one question in addition to Robert's comments:
The nit:
there's a pattern of a bunch of these:
updateJ("{\"add\":{\"doc\":
{\"id\":1,\"letter\":\"letter\"}},\"commit\":{}}",null);
.
.
then:
assertU(commit());
It's unnecessary to do the commit with the updateJ calls. the commit at the end
will take care of it all. It's a little less efficient to commit with each doc.
Frankly I doubt that'd be measurable, performance wise, but let's take them out
anyway.
The suggestion:
When we do stop accruing characters e.g. in CharTokenizer, let's log an INFO
level message to that effect, something like
log.info("Splitting token at {} chars", maxTokenLen);
That way people will have a clue where to look. I think INFO is appropriate
rather than WARN or ERROR since it's perfectly legitimate to truncate input,
I'm thinking OCR text for instance. Maybe dump the token we've accumulated so
far?
I worded it at "splitting" because (and there are tests for this) that the next
token picks up where the first left off. So if the max length is 3, and the
input is "whitespace", we get several tokens as a result,
"whi", "tes", "pac", and "e".
I suppose that means that the offsets are also incremented. Is that really what
we want here? Or should we instead throw away the extra tokens? [~rcmuir], what
do you think is correct? This is not a _change_ in behavior, the current code
does the same thing just a hard-coded 255 limit. I'm checking if this is
intended behavior.
If we do want to throw away the extra, we could spin through the buffer until
we encountered a non char then return the part < maxTokenLen. If we did that we
could also log the entire token and the truncated version if we wanted.
Otherwise precommit passes and all tests pass.
> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the
> max token length
> ---------------------------------------------------------------------------------------------
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Amrit Sarkar
> Assignee: Erick Erickson
> Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch,
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch,
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character
> limit for the CharTokenizer? In order to change this limit it requires that
> people copy/paste the incrementToken into some new class since incrementToken
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer)
> (Factories) it would take adding a c'tor to the base class in Lucene and
> using it in the factory.
> Any objections?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]