[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Robert Muir (JIRA) Wed, 10 May 2017 18:02:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005739#comment-16005739
 ]


Robert Muir commented on LUCENE-7705:
-------------------------------------

Why do the tokenizer ctor arg take {{Integer}} instead of {{int}}? They check 
for null and throws an exception in that case, but I think its better to encode 
the correct argument type in the method signature. This would also mean you can 
re-enable TestRandomChains for the newly-added ctors.

I think if basic tokenizers like WhitespaceTokenizer have new ctors that take 
{{int}}, and throw exception on illegal values for this parameter because some 
aren't permitted, that these new ctors really could use a bit better docs:
{quote}
@param maxTokenLen max token length the tokenizer will emit
{quote}

Instead, maybe something like:
{quote}
@param maxTokenLen maximum token length the tokenizer will emit. Must not be 
negative.
@throws IllegalArgumentException if maxTokenLen is invalid.
{quote}

Also I am concerned about the allowable value ranges and the disabling of tests 
as far as correctness. Instead if {{int}} is used instead of {{Integer}} this 
test can be re-enabled which may find/prevent crazy bugs. e.g. I worry a bit 
what will these Tokenizers do in each case if maxTokenLen is 0 (which is 
allowed), will they loop forever or crash, etc?

The maximum value of the range is also suspicious. What happens with any 
{{CharTokenizer}}-based tokenizer if i supply a value > IO_BUFFER_SIZE? Maybe 
it behaves correctly, maybe not. I think {{Integer.MAX_VALUE}} is a trap for a 
number of reasons: enormous values will likely only blow up anyway (limited to 
32766 in the index), if they don't they may create hellacious threadlocal 
memory usage in buffers, bad/malicious input could take out JVMs, etc.

Maybe follow the example of StandardTokenizer here:

{code}
/** Absolute maximum sized token */
public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
...
   * @throws IllegalArgumentException if the given length is outside of the
   *  range [1, {@value #MAX_TOKEN_LENGTH_LIMIT}].
...
if (length < 1) {
  throw new IllegalArgumentException("maxTokenLength must be greater than 
zero");
} else if (length > MAX_TOKEN_LENGTH_LIMIT) {
  throw new IllegalArgumentException("maxTokenLength may not exceed " + 
MAX_TOKEN_LENGTH_LIMIT);
}
{code}

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7705
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Reply via email to