[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Erick Erickson (JIRA) Sun, 26 Feb 2017 08:42:05 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884805#comment-15884805
 ]


Erick Erickson commented on LUCENE-7705:
----------------------------------------

Removing the extra arguments in the getMulitTermComponent() is certainly better 
than in any of the superclasses, you'd have the possibility of interfering with 
someone else's filter that _did_ coincidentally have a maxTokenLen parameter 
that should legitimately be passed through.

I guess that removing it in getMultiTermComponent() is OK. At least the place 
that gets the maxTokenLength argument (i.e. the factory) being responsible for 
removing it before passing the args on to the Filter keeps things from 
sprawling.

Although the other possibility is to just pass an empty map rather than munge 
the original ones. LowerCaseFilter's first act is to check whether the map is 
empty after all. Something like:

return new LowerCaseFilterFactory(Collections.EMPTY_MAP); (untested).

I see no justification for passing the original args anyway in this particular 
case, I'd guess it was just convenient. I think I like the EMPTY_MAP now that I 
think about it, but neither option is really all that superior IMO. The 
EMPTY_MAP will be slightly more efficient but I doubt it's really measurable.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7705
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Reply via email to