[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Amrit Sarkar (JIRA) Fri, 24 Feb 2017 19:19:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883979#comment-15883979
 ]


Amrit Sarkar commented on LUCENE-7705:
--------------------------------------

Erick,

Successfully able to pass all the tests in the current patch uploaded with 
minor corrections and rectifications in exiting test-classes.

{noformat}
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/KeywordTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/UnicodeWhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/util/CharTokenizer.java
new file:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestKeywordTokenizer.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestRandomChains.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUnicodeWhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestCharTokenizers.java
{noformat}

Test failure fixes:

1. org.apache.lucene.analysis.core.TestRandomChains (suite):

   Added the four tokenizer constructors failing to brokenConstructors map to 
bypass them without delay.
This class tends to check what arguments is legal for the constructors and 
create certain maps before-hand to check later. It doesn't take account of 
boxing/unboxing of primitive data types; hence when we are taking parameter in 
_"java.lang.Integer"_, while creating map it is unboxing it into _"int"_ itself 
and then fails because _"int.class"_ and  _"java.lang.Integer.class"_ doesn't 
match which doesn't make sense. Either we can fix how the maps are getting 
created or we skip these constructors for now.

2.  the getMultiTermComponent method constructed a LowerCaseFilterFactory with 
the original arguments including maxTokenLen, which then threw an error:

  Not sure what corrected that, but I see no suite failing, not even 
TestFactories which I suppose was throwing the error for incompatible 
constructors/noMethodFound etc. Kindly verify if we are still facing the issue 
or we need to harden the test cases for the same.


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7705
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Reply via email to