[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-06-02 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035783#comment-16035783
 ] 

ASF subversion and git services commented on LUCENE-7705:
-

Commit e4a43cf59a12ca39eb8278cc2533d409d792185a in lucene-solr's branch 
refs/heads/branch_6x from Erick
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e4a43cf ]

LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to 
configure the max token len (test fix)

(cherry picked from commit 15a8a24)


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-06-02 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035782#comment-16035782
 ] 

ASF subversion and git services commented on LUCENE-7705:
-

Commit 2eacf13def4dc9fbea1de9c79150c05682b0cdec in lucene-solr's branch 
refs/heads/master from Erick
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2eacf13 ]

LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to 
configure the max token length, fix test failure.


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-06-02 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035781#comment-16035781
 ] 

ASF subversion and git services commented on LUCENE-7705:
-

Commit 15a8a2415280d50c982fcd4fca893a3c3224da14 in lucene-solr's branch 
refs/heads/master from Erick
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=15a8a24 ]

LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to 
configure the max token len (test fix)


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-30 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030223#comment-16030223
 ] 

Erick Erickson commented on LUCENE-7705:


OK, I see what's happening. I noted earlier that the way this has always been 
implemented, multiple tokens are emitted when the token length is exceeded. In 
this case, the token sent in the doc is "letter". So two tokens are emitted:
"let" and "ter". With positions incremented between I think.

The search is against "lett". For some reason, the parsed query in 6x is:
PhraseQuery(letter0:let t)

while in master it's:
letter0:let letter0:t

Even this is wrong, it just happens to succeed because the default operator is 
OR, so the fact that and the tokens in the index do not include a bare "t" 
finds the doc by chance, not design.

I think the right solution is to stop emitting tokens for a particular value 
once maxTokenLen is exceeded. I'll raise a new JIRA and we can debate it here.

This is _not_ any change in behavior resulting from the changes in this JIRA, 
the tests just expose something that's always been the case but nobody's 
noticed.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-30 Thread Erick Erickson
Ah, ok. that one fails for me too on 6x, was trying on trunk. Thanks... Digging.

On Tue, May 30, 2017 at 12:44 PM, Steve Rowe (JIRA)  wrote:
>
> [ 
> https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030016#comment-16030016
>  ]
>
> Steve Rowe commented on LUCENE-7705:
> 
>
> My seed reproduces for me both on Linux and on my Macbook pro (Sierra 
> 10.12.5, Oracle JDK 1.8.0_112).  Note that the original failure was on 
> branch_6x (and that's where I repro'd).
>
>> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
>> max token length
>> -
>>
>> Key: LUCENE-7705
>> URL: https://issues.apache.org/jira/browse/LUCENE-7705
>> Project: Lucene - Core
>>  Issue Type: Improvement
>>Reporter: Amrit Sarkar
>>Assignee: Erick Erickson
>>Priority: Minor
>> Fix For: master (7.0), 6.7
>>
>> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
>> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
>> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>>
>>
>> SOLR-10186
>> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
>> limit for the CharTokenizer? In order to change this limit it requires that 
>> people copy/paste the incrementToken into some new class since 
>> incrementToken is final.
>> KeywordTokenizer can easily change the default (which is also 256 bytes), 
>> but to do so requires code rather than being able to configure it in the 
>> schema.
>> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
>> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
>> (Factories) it would take adding a c'tor to the base class in Lucene and 
>> using it in the factory.
>> Any objections?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-30 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030016#comment-16030016
 ] 

Steve Rowe commented on LUCENE-7705:


My seed reproduces for me both on Linux and on my Macbook pro (Sierra 10.12.5, 
Oracle JDK 1.8.0_112).  Note that the original failure was on branch_6x (and 
that's where I repro'd).

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-30 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029981#comment-16029981
 ] 

Erick Erickson commented on LUCENE-7705:


Hmmm, Steve's seed succeeds on my macbook pro, and I beasted this test 100 
times on my mac pro without failures. Not quite sure what to do next...

I wonder if SOLR-10562 is biting us again? Although that doesn't really make 
sense as the line before this failure finds the same document.

I'll beast this on a different system I have access to when I can (it's shared).

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-30 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029872#comment-16029872
 ] 

Steve Rowe commented on LUCENE-7705:


Here's another reproducing failure 
[https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/3612]: 

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestMaxTokenLenTokenizer -Dtests.method=testSingleFieldSameAnalyzers 
-Dtests.seed=FE4BE1CA39C9E0DA -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=vi-VN -Dtests.timezone=Australia/NSW -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.07s J1 | 
TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers <<<
   [junit4]> Throwable #1: java.lang.RuntimeException: Exception during 
query
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0)
   [junit4]>at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895)
   [junit4]>at 
org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
   [junit4]> Caused by: java.lang.RuntimeException: REQUEST FAILED: 
xpath=//result[@numFound=1]
   [junit4]>xml response was: 
   [junit4]> 
   [junit4]> 00
   [junit4]> 
   [junit4]>request was:q=letter0:lett=xml
   [junit4]>at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:888)
[...]
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): 
{lowerCase0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
 whiteSpace=Lucene50(blocksize=128), letter=BlockTreeOrds(blocksize=128), 
lowerCase=Lucene50(blocksize=128), 
unicodeWhiteSpace=BlockTreeOrds(blocksize=128), letter0=FST50, 
unicodeWhiteSpace0=FST50, 
keyword0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
 
id=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))),
 keyword=Lucene50(blocksize=128), 
whiteSpace0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128)))},
 docValues:{}, maxPointsInLeafNode=579, maxMBSortInHeap=5.430197160407458, 
sim=RandomSimilarity(queryNorm=false,coord=crazy): {}, locale=vi-VN, 
timezone=Australia/NSW
   [junit4]   2> NOTE: Linux 4.10.0-21-generic amd64/Oracle Corporation 
1.8.0_131 (64-bit)/cpus=8,threads=1,free=234332328,total=536870912
{noformat}

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-30 Thread Erick Erickson
Let me take a look. Tests passed for me, probably some of the random stuff?


On Tue, May 30, 2017 at 9:39 AM, Tomás Fernández Löbbe (JIRA)
 wrote:
>
> [ 
> https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029694#comment-16029694
>  ]
>
> Tomás Fernández Löbbe commented on LUCENE-7705:
> ---
>
> TestMaxTokenLenTokenizer seems to be failing in Jenkins
> {noformat}
> 1 tests failed.
> FAILED:  
> org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers
>
> Error Message:
> Exception during query
>
> Stack Trace:
> java.lang.RuntimeException: Exception during query
> at 
> __randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0)
> at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895)
> at 
> org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1713)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:907)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:943)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:957)
> at 
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:916)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:802)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:852)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
> at 
> 

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029694#comment-16029694
 ] 

Tomás Fernández Löbbe commented on LUCENE-7705:
---

TestMaxTokenLenTokenizer seems to be failing in Jenkins
{noformat}
1 tests failed.
FAILED:  
org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at 
__randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895)
at 
org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1713)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:907)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:943)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:957)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:916)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:802)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:852)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: REQUEST FAILED: 
xpath=//result[@numFound=1]
xml response was: 


[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027974#comment-16027974
 ] 

Erick Erickson commented on LUCENE-7705:


Oh bother. commit number. 
Commit to master is: 906679adc80f0fad1e5c311b03023c7bd95633d7
Commit to 6x is: 3db93a1f0980bd098af892630c6dc95c4c962e14



> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-25 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025386#comment-16025386
 ] 

Robert Muir commented on LUCENE-7705:
-

Tests such as this are not effective. If no exception is thrown the test will 
pass.

{noformat}
+try {
+  new LetterTokenizer(newAttributeFactory(), 0);
+} catch (Exception e) {
+  assertEquals("maxTokenLen must be greater than 0 and less than 1048576 
passed: 0", e.getMessage());
+}
{noformat}

I would use expectThrows in this case instead.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006260#comment-16006260
 ] 

Robert Muir commented on LUCENE-7705:
-

I don't think its within the scope of this issue.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-10 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005894#comment-16005894
 ] 

Erick Erickson commented on LUCENE-7705:


Fair enough, I suppose you could get log flies 20 bazillion bytes long.

WDYT about generating multiple tokens when the length is exceeded?

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-10 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005892#comment-16005892
 ] 

Robert Muir commented on LUCENE-7705:
-

Sorry, I think its a very bad idea to log tokens in lucene's analysis chains 
for any reason.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-10 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005888#comment-16005888
 ] 

Erick Erickson commented on LUCENE-7705:


One nit and one suggestion and one question in addition to Robert's comments:

The nit:
there's a pattern of a bunch of these:

updateJ("{\"add\":{\"doc\": 
{\"id\":1,\"letter\":\"letter\"}},\"commit\":{}}",null);
.
.
then:
assertU(commit());

It's unnecessary to do the commit with the updateJ calls. the commit at the end 
will take care of it all. It's a little less efficient to commit with each doc. 
Frankly I doubt that'd be measurable, performance wise, but let's take them out 
anyway.

The suggestion:
When we do stop accruing characters e.g. in CharTokenizer, let's log an INFO 
level message to that effect, something like

log.info("Splitting token at {} chars", maxTokenLen);

That way people will have a clue where to look. I think INFO is appropriate 
rather than WARN or ERROR since it's perfectly legitimate to truncate input, 
I'm thinking OCR text for instance. Maybe dump the token we've accumulated so 
far?

I worded it at "splitting" because (and there are tests for this) that the next 
token picks up where the first left off. So if the max length is 3, and the 
input is "whitespace", we get several tokens as a result, 

"whi", "tes", "pac", and "e". 

I suppose that means that the offsets are also incremented. Is that really what 
we want here? Or should we instead throw away the extra tokens? [~rcmuir], what 
do you think is correct? This is not a _change_ in behavior, the current code 
does the same thing just a hard-coded 255 limit. I'm checking if this is 
intended behavior. 

If we do want to throw away the extra, we could spin through the buffer until 
we encountered a non char then return the part < maxTokenLen. If we did that we 
could also log the entire token and the truncated version if we wanted.

Otherwise precommit passes and all tests pass.


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-10 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005739#comment-16005739
 ] 

Robert Muir commented on LUCENE-7705:
-

Why do the tokenizer ctor arg take {{Integer}} instead of {{int}}? They check 
for null and throws an exception in that case, but I think its better to encode 
the correct argument type in the method signature. This would also mean you can 
re-enable TestRandomChains for the newly-added ctors.

I think if basic tokenizers like WhitespaceTokenizer have new ctors that take 
{{int}}, and throw exception on illegal values for this parameter because some 
aren't permitted, that these new ctors really could use a bit better docs:
{quote}
@param maxTokenLen max token length the tokenizer will emit
{quote}

Instead, maybe something like:
{quote}
@param maxTokenLen maximum token length the tokenizer will emit. Must not be 
negative.
@throws IllegalArgumentException if maxTokenLen is invalid.
{quote}

Also I am concerned about the allowable value ranges and the disabling of tests 
as far as correctness. Instead if {{int}} is used instead of {{Integer}} this 
test can be re-enabled which may find/prevent crazy bugs. e.g. I worry a bit 
what will these Tokenizers do in each case if maxTokenLen is 0 (which is 
allowed), will they loop forever or crash, etc?

The maximum value of the range is also suspicious. What happens with any 
{{CharTokenizer}}-based tokenizer if i supply a value > IO_BUFFER_SIZE? Maybe 
it behaves correctly, maybe not. I think {{Integer.MAX_VALUE}} is a trap for a 
number of reasons: enormous values will likely only blow up anyway (limited to 
32766 in the index), if they don't they may create hellacious threadlocal 
memory usage in buffers, bad/malicious input could take out JVMs, etc.

Maybe follow the example of StandardTokenizer here:

{code}
/** Absolute maximum sized token */
public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
...
   * @throws IllegalArgumentException if the given length is outside of the
   *  range [1, {@value #MAX_TOKEN_LENGTH_LIMIT}].
...
if (length < 1) {
  throw new IllegalArgumentException("maxTokenLength must be greater than 
zero");
} else if (length > MAX_TOKEN_LENGTH_LIMIT) {
  throw new IllegalArgumentException("maxTokenLength may not exceed " + 
MAX_TOKEN_LENGTH_LIMIT);
}
{code}

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-10 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005672#comment-16005672
 ] 

Erick Erickson commented on LUCENE-7705:


Amrit:

I'm using LUCENE-7705 (note no .patch extension). Is that the right one?



> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-10 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004169#comment-16004169
 ] 

Amrit Sarkar commented on LUCENE-7705:
--

Erick, we got everything right this time.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-09 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002994#comment-16002994
 ] 

Erick Erickson commented on LUCENE-7705:


Sorry 'bout that, rushing out the door this morning and didn't do the git add 
bits.

We'll get it all right yet.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-03-31 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951168#comment-15951168
 ] 

Erick Erickson commented on LUCENE-7705:


Thanks!

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-03-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951158#comment-15951158
 ] 

Robert Muir commented on LUCENE-7705:
-

I think that LUCENE-7762 discussion is unrelated: it has to do with exposing 
such things again at the Analyzer level, vs going the customanalyzer route

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-03-31 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951049#comment-15951049
 ] 

Erick Erickson commented on LUCENE-7705:


[~rcmuir][~mikemccand] Wanted to check whether the discussion at LUCENE-7762 
applies here or not. 

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-03-13 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907737#comment-15907737
 ] 

Erick Erickson commented on LUCENE-7705:


May or may not depend on SOLR-10229, but we want to approach them both together.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-03-06 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897762#comment-15897762
 ] 

Erick Erickson commented on LUCENE-7705:


Patch looks good. I'm going to hang back on committing this until we figure out 
SOLR-10229 (control schema proliferation). The additional schema you put in 
here is about the only way currently to test Solr schemas, so that's perfectly 
appropriate. I'd just like to use this as a test case for what it would take to 
move constructing schemas to inside the tests rather than have each new case 
like this require another schema that we then have to maintain.

But if SOLR-10229 takes very long I'll just commit this one and we can work out 
the rest later.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-26 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885144#comment-15885144
 ] 

Amrit Sarkar commented on LUCENE-7705:
--

I did some adjustments, including removing maxTokenLen for 
LowerCaseFilterFactory init and included hard test cases for the tokenizers at 
solr level: TestTokenizer.java 

{noformat}
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/KeywordTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/UnicodeWhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/util/CharTokenizer.java
new file:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestKeywordTokenizer.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestRandomChains.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUnicodeWhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestCharTokenizers.java
new file:   
solr/core/src/test-files/solr/collection1/conf/schema-tokenizer-test.xml
new file:   solr/core/src/test/org/apache/solr/util/TestTokenizer.java
{noformat}

I think we have covered everything, all tests passed. 

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-26 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884983#comment-15884983
 ] 

Erick Erickson commented on LUCENE-7705:


Ah, good catch. So yeah, we'll have to remove the maxTokenLen from the original 
args and pass the modified map through.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-26 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884831#comment-15884831
 ] 

Amrit Sarkar commented on LUCENE-7705:
--

Erick,

For every tokenizer init, two parameters are already included in their 
arguments as below:
{noformat}
{class=solr.LowerCaseTokenizerFactory, luceneMatchVersion=7.0.0}
{noformat}

which is consumed by AbstractAnalysisFactory while it instantiate:
{noformat}
  originalArgs = Collections.unmodifiableMap(new HashMap<>(args));
System.out.println("orgs:: "+originalArgs);
String version = get(args, LUCENE_MATCH_VERSION_PARAM);
if (version == null) {
  luceneMatchVersion = Version.LATEST;
} else {
  try {
luceneMatchVersion = Version.parseLeniently(version);
  } catch (ParseException pe) {
throw new IllegalArgumentException(pe);
  }
}
args.remove(CLASS_NAME);  // consume the class arg
  }
{noformat}
_class_ parameter is useless, we don't have to worry about it, while it do look 
up for _luceneMatchVersion_ which is kind of sanity check for the versions, not 
sure anything important takes place at Version::parseLeniently(version) 
function. If we can confirm that, we can pass empty map there.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-26 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884805#comment-15884805
 ] 

Erick Erickson commented on LUCENE-7705:


Removing the extra arguments in the getMulitTermComponent() is certainly better 
than in any of the superclasses, you'd have the possibility of interfering with 
someone else's filter that _did_ coincidentally have a maxTokenLen parameter 
that should legitimately be passed through.

I guess that removing it in getMultiTermComponent() is OK. At least the place 
that gets the maxTokenLength argument (i.e. the factory) being responsible for 
removing it before passing the args on to the Filter keeps things from 
sprawling.

Although the other possibility is to just pass an empty map rather than munge 
the original ones. LowerCaseFilter's first act is to check whether the map is 
empty after all. Something like:

return new LowerCaseFilterFactory(Collections.EMPTY_MAP); (untested).

I see no justification for passing the original args anyway in this particular 
case, I'd guess it was just convenient. I think I like the EMPTY_MAP now that I 
think about it, but neither option is really all that superior IMO. The 
EMPTY_MAP will be slightly more efficient but I doubt it's really measurable.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-26 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884739#comment-15884739
 ] 

Amrit Sarkar commented on LUCENE-7705:
--

Erick,

I wrote the test-cases and it is a problem, but removing "maxTokenLen" from 
original arguments which initialize LowerCaseFilterFactory makes sense, and it 
is not hack. We have to remove the argument for the FilterFactory init 
somewhere and it will be better if we do where we are making the call. I am not 
inclined towards removing this at FilterFactory init or AbstractAnalysisFactory 
func call. So we are left with two options, either we don't provide option for 
maxTokenLen for LowerCaseTokenizer or we remove the extra argument as you have 
done on getMultiTermComponent().

Let me know your thoughts.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-25 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884209#comment-15884209
 ] 

Amrit Sarkar commented on LUCENE-7705:
--

Roger that! I will incorporate test-cases for the stated tokenizers and try to 
fix the issue without removing _maxTokenLength_ from the method.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-24 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884014#comment-15884014
 ] 

Erick Erickson commented on LUCENE-7705:


Actually, I didn't encounter the error in LowerCaseFilterFactory until I tried 
it out from a fully-compiled Solr instance with the maxTokenLen in the 
managed_schema file. I was thinking that it might make sense to add the 
maxTokenLen to a couple of the schemas used by some of the test cases, leaving 
it at the default value or 256 just to get some test coverage. I think this is 
really the difference between a test case at the Lucene level and one based on 
the schema from a Solr level.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-24 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883979#comment-15883979
 ] 

Amrit Sarkar commented on LUCENE-7705:
--

Erick,

Successfully able to pass all the tests in the current patch uploaded with 
minor corrections and rectifications in exiting test-classes.

{noformat}
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/KeywordTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/UnicodeWhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.java
modified:   
lucene/analysis/common/src/java/org/apache/lucene/analysis/util/CharTokenizer.java
new file:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestKeywordTokenizer.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestRandomChains.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUnicodeWhitespaceTokenizer.java
modified:   
lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestCharTokenizers.java
{noformat}

Test failure fixes:

1. org.apache.lucene.analysis.core.TestRandomChains (suite):

   Added the four tokenizer constructors failing to brokenConstructors map to 
bypass them without delay.
This class tends to check what arguments is legal for the constructors and 
create certain maps before-hand to check later. It doesn't take account of 
boxing/unboxing of primitive data types; hence when we are taking parameter in 
_"java.lang.Integer"_, while creating map it is unboxing it into _"int"_ itself 
and then fails because _"int.class"_ and  _"java.lang.Integer.class"_ doesn't 
match which doesn't make sense. Either we can fix how the maps are getting 
created or we skip these constructors for now.

2.  the getMultiTermComponent method constructed a LowerCaseFilterFactory with 
the original arguments including maxTokenLen, which then threw an error:

  Not sure what corrected that, but I see no suite failing, not even 
TestFactories which I suppose was throwing the error for incompatible 
constructors/noMethodFound etc. Kindly verify if we are still facing the issue 
or we need to harden the test cases for the same.


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-23 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881610#comment-15881610
 ] 

Erick Erickson commented on LUCENE-7705:


These two tests fail:
org.apache.lucene.analysis.core.TestUnicodeWhitespaceTokenizer.testParamsFactory
org.apache.lucene.analysis.core.TestRandomChains (suite)


TestUnicideWhitespaceTokenizer is because I added "to" to one of the exception 
messages, a trivial fix

No idea what's happening with TestRandomChains though.

And my nocommit insures that 'ant precommit' will fail too, that's rather the 
point.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-23 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881459#comment-15881459
 ] 

Erick Erickson commented on LUCENE-7705:


I think the patch I uploaded is a result of applying your most recent patch for 
SOLR-10186 but can you verify? We should probably consolidate two, I suggest we 
close the Solr one as a duplicate and continue iterating here.

Erick

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Priority: Minor
> Attachments: LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-23 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881239#comment-15881239
 ] 

Amrit Sarkar commented on LUCENE-7705:
--

I have cooked up a patch in SOLR-10186, and introduced new constructor in 
CharTokenizer and related Tokenizer factories, which takes _maxCharLen_ and 
_factory_ as parameters along with it.

Kindly provide your feedback and any comments on introducing new constructors 
in the classes. Thanks.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Priority: Minor
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org