[
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005739#comment-16005739
]
Robert Muir commented on LUCENE-7705:
-------------------------------------
Why do the tokenizer ctor arg take {{Integer}} instead of {{int}}? They check
for null and throws an exception in that case, but I think its better to encode
the correct argument type in the method signature. This would also mean you can
re-enable TestRandomChains for the newly-added ctors.
I think if basic tokenizers like WhitespaceTokenizer have new ctors that take
{{int}}, and throw exception on illegal values for this parameter because some
aren't permitted, that these new ctors really could use a bit better docs:
{quote}
@param maxTokenLen max token length the tokenizer will emit
{quote}
Instead, maybe something like:
{quote}
@param maxTokenLen maximum token length the tokenizer will emit. Must not be
negative.
@throws IllegalArgumentException if maxTokenLen is invalid.
{quote}
Also I am concerned about the allowable value ranges and the disabling of tests
as far as correctness. Instead if {{int}} is used instead of {{Integer}} this
test can be re-enabled which may find/prevent crazy bugs. e.g. I worry a bit
what will these Tokenizers do in each case if maxTokenLen is 0 (which is
allowed), will they loop forever or crash, etc?
The maximum value of the range is also suspicious. What happens with any
{{CharTokenizer}}-based tokenizer if i supply a value > IO_BUFFER_SIZE? Maybe
it behaves correctly, maybe not. I think {{Integer.MAX_VALUE}} is a trap for a
number of reasons: enormous values will likely only blow up anyway (limited to
32766 in the index), if they don't they may create hellacious threadlocal
memory usage in buffers, bad/malicious input could take out JVMs, etc.
Maybe follow the example of StandardTokenizer here:
{code}
/** Absolute maximum sized token */
public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
...
* @throws IllegalArgumentException if the given length is outside of the
* range [1, {@value #MAX_TOKEN_LENGTH_LIMIT}].
...
if (length < 1) {
throw new IllegalArgumentException("maxTokenLength must be greater than
zero");
} else if (length > MAX_TOKEN_LENGTH_LIMIT) {
throw new IllegalArgumentException("maxTokenLength may not exceed " +
MAX_TOKEN_LENGTH_LIMIT);
}
{code}
> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the
> max token length
> ---------------------------------------------------------------------------------------------
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Amrit Sarkar
> Assignee: Erick Erickson
> Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch,
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch,
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character
> limit for the CharTokenizer? In order to change this limit it requires that
> people copy/paste the incrementToken into some new class since incrementToken
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer)
> (Factories) it would take adding a c'tor to the base class in Lucene and
> using it in the factory.
> Any objections?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]