[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504577#comment-16504577
 ] 

Alan Woodward commented on LUCENE-8273:
---

I'll let RandomChains stew for another 24 hours, but I think this is ready to 
be resolved for the upcoming 7.4 release.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504576#comment-16504576
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit a4fa16896225e08b72bf64fba97a216bb6a83fbb in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4fa168 ]

LUCENE-8273: Don't wrap ShingleFilter in conditions in testRandomChains


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504575#comment-16504575
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit dfb679a93cc9aae2620e7ce110c9515181bd17c6 in lucene-solr's branch 
refs/heads/branch_7x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dfb679a ]

LUCENE-8273: Don't wrap ShingleFilter in conditions in testRandomChains


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504560#comment-16504560
 ] 

Alan Woodward commented on LUCENE-8273:
---

Yeah, it's the same issue as LUCENE-4170.  I'll add it to the blacklist.


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504554#comment-16504554
 ] 

Robert Muir commented on LUCENE-8273:
-

If shinglefilter is the buggy one it should be banned from the test for sure, 
and a issue opened for its bugginess.

Its not a test coverage concern: this can't replace unit tests: it exists to 
find new buggy interactions between the analysis components, like this one here.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504540#comment-16504540
 ] 

Alan Woodward commented on LUCENE-8273:
---

Both of these failures are due to ShingleFilter not properly handling graphs.  
Without being wrapped in a condition, the ShingleFilter is mangling its input 
graph, but it's doing it in a consistent way, so the ValidatingTokenFilter is 
happy.  However, if it's randomly turned off, then occasionally the 
ValidatingTokenFilter gets the plain input graph as opposed to the mangled one, 
and so it complains because offsets are no longer consistent.

I'm not quite sure how best to fix this.  Ideally, we'd just fix ShingleFilter, 
but that's not as simple as it sounds.  Perhaps the simplest thing to do is to 
add ShingleFilter to the blacklist, and document that ConditionalTokenFilter 
won't work with broken graph inputs?



> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504369#comment-16504369
 ] 

Alan Woodward commented on LUCENE-8273:
---

A couple more failing seeds:
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
07:41:21[junit4]   2> TEST FAIL: useCharFilter=true text='t 
\u0af5\u0a9f\u0acb\u0ada\u0aa6 \u0011\u02eb^ q hnhpwei txslx  e \u22c8\u22d9 
\u2e06\u2e15\u2e6a\u2e05 uv im i \u1387\u1391\u1398\u1386\u138c  j'
07:41:21[junit4]   2> Exception from random analyzer: 
07:41:21[junit4]   2> charfilters=
07:41:21[junit4]   2>   
org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@495f18ce)
07:41:21[junit4]   2> tokenizer=
07:41:21[junit4]   2>   
org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@37e10425)
07:41:21[junit4]   2> filters=ConditionalTokenFilter: 
07:41:21[junit4]   2>   
org.apache.lucene.analysis.MockGraphTokenFilter(java.util.Random@4d3a58d5, 
OneTimeWrapper@7c12a0d4 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter:
 
07:41:21[junit4]   2>   
org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@75d26c0e 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
 )
07:41:21[junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings 
-Dtests.seed=A61F0C126076A16B -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=ar-TN -Dtests.timezone=US/Arizona -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
07:41:21[junit4] ERROR   0.55s J2 | 
TestRandomChains.testRandomChainsWithLargeStrings <<<
07:41:21[junit4]> Throwable #1: java.lang.IllegalStateException: last 
stage: inconsistent endOffset at pos=10: 41 vs 55; token=L寂,釟'Ƈ⨄{ঠ֍
{code}
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
15:11:36[junit4]   2> TEST FAIL: useCharFilter=true 
text='\u2d8e\u2dbb\u2daf\u2d8b\u2d97\u2dd5\u2d97\u2dcc 
\u035b\u5996\u07ca\u0003\u12e3\u6450\uf36f '
15:11:36[junit4]   2> Exception from random analyzer: 
15:11:36[junit4]   2> charfilters=
15:11:36[junit4]   2> tokenizer=
15:11:36[junit4]   2>   
org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer()
15:11:36[junit4]   2> filters=
15:11:36[junit4]   2>   
org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@3f23768a 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
 lbay)ConditionalTokenFilter: 
15:11:36[junit4]   2>   
org.apache.lucene.analysis.SimplePayloadFilter(OneTimeWrapper@114147eb 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null)ConditionalTokenFilter:
 
15:11:36[junit4]   2>   
org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@680fa1a5 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null,
 7)
15:11:36[junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
-Dtests.seed=49B6EB935C8F8B35 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=nl-NL -Dtests.timezone=BST -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
15:11:36[junit4] ERROR   0.05s J2 | TestRandomChains.testRandomChains <<<
15:11:36[junit4]> Throwable #1: java.lang.IllegalStateException: last 
stage: inconsistent endOffset at pos=9: 12 vs 14; token=ߊ ዣዣ
{code}
I'm looking at these now

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-29 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493678#comment-16493678
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 8577aee08f8cf348a674dfbcc7b408358380d260 in lucene-solr's branch 
refs/heads/branch_7x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8577aee ]

LUCENE-8273: Adjust position increments when filtering stacked tokens


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-29 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493674#comment-16493674
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 4ea9d2ea8cbb036bac6aa1e61161afc65d04a1be in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ea9d2e ]

LUCENE-8273: Adjust position increments when filtering stacked tokens


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-25 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491222#comment-16491222
 ] 

Steve Rowe commented on LUCENE-8273:


Not sure if this is the same sort of problem you already know about, 
[~romseygeek], but my Jenkins found a reproducing TestRandomChains master seed 
for a chain including ConditionalTokenFilter:

{noformat}
Checking out Revision 18ad8d137afa8e2017f4121ddced4d630b1c86a1 
(refs/remotes/origin/master)
[...]
   [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false 
text='\u0bd2\u0ba7\u0bdc\u0bf5\u0b96\u0ba5 qrhnhrlyeh \u6d41\u0601\u033a\ue4e1 
bvmoocaycg dtazdd fn]|b({0,5  vjsrn'
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2> tokenizer=
   [junit4]   2>   org.apache.lucene.analysis.standard.ClassicTokenizer()
   [junit4]   2> filters=ConditionalTokenFilter: 
   [junit4]   2>   
org.apache.lucene.analysis.miscellaneous.TypeAsSynonymFilter(OneTimeWrapper@588fa3b0
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
 tzqrpnf)
   [junit4]   2>   
org.apache.lucene.analysis.no.NorwegianLightStemFilter(ValidatingTokenFilter@63bdcc8b
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false)ConditionalTokenFilter:
 
   [junit4]   2>   
org.apache.lucene.analysis.core.TypeTokenFilter(OneTimeWrapper@28da9033 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false,
 [, , ], true)
   [junit4]   2> NOTE: download the large Jenkins line-docs file by running 
'ant get-jenkins-line-docs' in the lucene directory.
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains 
-Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=735F260A3B6964C3 
-Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true 
-Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
-Dtests.locale=ca -Dtests.timezone=Europe/Isle_of_Man -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   39.0s J2 | 
TestRandomChains.testRandomChainsWithLargeStrings <<<
   [junit4]> Throwable #1: java.lang.IllegalStateException: last stage: 
inconsistent startOffset at pos=6: 48 vs 53; token=tzqrpnf
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([735F260A3B6964C3:1904991B62274430]:0)
   [junit4]>at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:106)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:746)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:657)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:559)
   [junit4]>at 
org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:882)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: leaving temporary files on disk at: 
/var/lib/jenkins/jobs/Lucene-Solr-Nightly-master/workspace/lucene/build/analysis/common/test/J2/temp/lucene.analysis.core.TestRandomChains_735F260A3B6964C3-001
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
{dummy=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, docValues:{}, 
maxPointsInLeafNode=1103, maxMBSortInHeap=7.435649827990908, 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@5743798e),
 locale=ca, timezone=Europe/Isle_of_Man
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
1.8.0_151 (64-bit)/cpus=16,threads=1,free=293435744,total=411566080
{noformat}

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489903#comment-16489903
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 23ab3d19ab0239bae4cc0acdbc01a8b65c637d68 in lucene-solr's branch 
refs/heads/branch_7x from [~steve_rowe]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=23ab3d1 ]

LUCENE-8273: Move test resources to where they belong


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489904#comment-16489904
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 2f38342687f1af3a092f1fbfa183a09a55490cb3 in lucene-solr's branch 
refs/heads/master from [~steve_rowe]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f38342 ]

LUCENE-8273: Move test resources to where they belong


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-23 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486886#comment-16486886
 ] 

Alan Woodward commented on LUCENE-8273:
---

The elastic CI has found some reproducing seeds in TestRandomChains that look 
like the following:
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
01:47:39[junit4]   2> Exception from random analyzer: 
01:47:39[junit4]   2> charfilters=
01:47:39[junit4]   2>   
org.apache.lucene.analysis.fa.PersianCharFilter(java.io.StringReader@36de1051)
01:47:39[junit4]   2>   
org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@31483c67,
 org.apache.lucene.analysis.fa.PersianCharFilter@51a9d324)
01:47:39[junit4]   2> tokenizer=
01:47:39[junit4]   2>   
org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@27232fb3,
 35)
01:47:39[junit4]   2> filters=ConditionalTokenFilter: 
01:47:39[junit4]   2>   
org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter(OneTimeWrapper@5f621e45
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
 
org.apache.lucene.analysis.compound.hyphenation.HyphenationTree@40cdd67e)ConditionalTokenFilter:
 
01:47:39[junit4]   2>   
org.apache.lucene.analysis.in.IndicNormalizationFilter(OneTimeWrapper@2de2e47c 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter:
 
01:47:39[junit4]   2>   
org.apache.lucene.analysis.MockRandomLookaheadTokenFilter(java.util.Random@4ced13ac,
 OneTimeWrapper@7d30a80d 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)
01:47:39[junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings 
-Dtests.seed=72E157E8E16C0F79 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=en-US -Dtests.timezone=America/Anguilla -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
01:47:39[junit4] FAILURE 0.57s J0 | 
TestRandomChains.testRandomChainsWithLargeStrings <<<
01:47:39[junit4]> Throwable #1: java.lang.AssertionError
01:47:39[junit4]>   at 
__randomizedtesting.SeedInfo.seed([72E157E8E16C0F79:18BAE8F9B8222F8A]:0)
01:47:39[junit4]>   at 
org.apache.lucene.analysis.LookaheadTokenFilter.peekToken(LookaheadTokenFilter.java:140)
{code}

The root cause is that LookaheadTokenFilter doesn't play well with 
ConditionalTokenFilter when we have stacked tokens:
- CTF works by presenting the underlying TokenStream to its wrapped filter as a 
series of snippets, demarcated by tokens that don't pass the {{shouldFilter()}} 
test.  When a new snippet is started (i.e. when a token that passes 
{{shouldFilter()}} appears after one that doesn't) then {{reset()}} is called 
on the delegate, and when it stops (i.e. when a token that doesn't pass 
{{shouldFilter()}} appears) then {{end()}} is called.
- This means that if we have stacked tokens, with the first not passing 
{{shouldFilter()}} and the second passing it, the wrapped filter can see a 
TokenStream that has an initial position increment of 0
- LookaheadTokenFilter has an explicit assertion that checks we don't have an 
initial posInc of 0

I think this can be fixed by having a posInc adjustment when we're delegating, 
so that the delegated snippet starts with a posInc of 1, but this is then 
adjusted downwards by the CTF before it's emitted.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483624#comment-16483624
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 0934e2a998ac43e46594e049daab751d8cae2476 in lucene-solr's branch 
refs/heads/branch_7x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0934e2a ]

LUCENE-8273: Don't wrap MinHashFilter in a condition

MinHashFilter needs to consume the entire tokenstream, so wrapping it in a
randomized condition makes no sense, and breaks offsets.


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483625#comment-16483625
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 24c186eff9a9b2b2c0a86fc0a828bd81ba0993e8 in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=24c186e ]

LUCENE-8273: Don't wrap MinHashFilter in a condition

MinHashFilter needs to consume the entire tokenstream, so wrapping it in a
randomized condition makes no sense, and breaks offsets.


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482583#comment-16482583
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 0c0fce3e98c9a01c330329eca5153fb78c7decaf in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0c0fce3 ]

LUCENE-8273: TestRandomChains found some more end() handling problems


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482582#comment-16482582
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit a69321a4d05d30f06248d0a33a237d8978942a9f in lucene-solr's branch 
refs/heads/branch_7x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a69321a ]

LUCENE-8273: TestRandomChains found some more end() handling problems


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482085#comment-16482085
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit d91273ddf0b524e14ebcc2a90bbedd8d9ae319d4 in lucene-solr's branch 
refs/heads/master from [~steve_rowe]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d91273d ]

LUCENE-8273: Rename TermExclusionFilter -> ProtectedTermFilter.  Allow 
ProtectedTermFilterFactory to be used outside of CustomAnalyzer, including in 
Solr, by allowing wrapped filters and their parameters to be specified on 
construction.  Add tests for ProtectedTermFilterFactory in 
lucene/common/analysis/ and in solr/core/.  Add Solr ref guide documentation 
for ProtectedTermFilterFactory.  Improve javadocs for CustomAnalyzer, 
ConditionalTokenFilter, and ProtectedTermFilter.


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482084#comment-16482084
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 97490ac7c66313625c9526a466ab96981e249746 in lucene-solr's branch 
refs/heads/branch_7x from [~steve_rowe]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=97490ac ]

LUCENE-8273: Rename TermExclusionFilter -> ProtectedTermFilter.  Allow 
ProtectedTermFilterFactory to be used outside of CustomAnalyzer, including in 
Solr, by allowing wrapped filters and their parameters to be specified on 
construction.  Add tests for ProtectedTermFilterFactory in 
lucene/common/analysis/ and in solr/core/.  Add Solr ref guide documentation 
for ProtectedTermFilterFactory.  Improve javadocs for CustomAnalyzer, 
ConditionalTokenFilter, and ProtectedTermFilter.


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-20 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482079#comment-16482079
 ] 

Steve Rowe commented on LUCENE-8273:


Attached remaining parts from my -rebased patch.  Tests and precommit pass. 
Committing shortly.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480717#comment-16480717
 ] 

Steve Rowe commented on LUCENE-8273:


{quote}
Would you like me to make a new patch from the other stuff in my patch?
Yes please! Feel free to commit it if you think it's ready, I wasn't sure if 
you still wanted to add more testing or docs.
{quote}
OK, will do. Yeah, I think it's ready to go, but I'll re-run precommit and 
tests before I commit.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480712#comment-16480712
 ] 

Alan Woodward commented on LUCENE-8273:
---

bq. Would you like me to make a new patch from the other stuff in my patch?

Yes please!  Feel free to commit it if you think it's ready, I wasn't sure if 
you still wanted to add more testing or docs.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480698#comment-16480698
 ] 

Steve Rowe commented on LUCENE-8273:


bq. Attached is a patch that includes Steve Rowe's test improvements. 

Would you like me to make a new patch from the other stuff in my patch?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480594#comment-16480594
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 443bd2ac526a40079e9028a5e7d8ad2f24ca6a4b in lucene-solr's branch 
refs/heads/branch_7x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=443bd2a ]

LUCENE-8273: Fix end() and posInc handling


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480595#comment-16480595
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit b1ee23c525a64017242148bca4368fe1be3a in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b1ee23c ]

LUCENE-8273: Fix end() and posInc handling


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480574#comment-16480574
 ] 

Alan Woodward commented on LUCENE-8273:
---

Nope, that's a relic from a failed attempt to fix something, and precommit has 
just pulled me up on it not having any javadocs :) Will nuke it before I commit.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480559#comment-16480559
 ] 

Robert Muir commented on LUCENE-8273:
-

is the TokenBuffer class in the patch actually used?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-18 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480549#comment-16480549
 ] 

Alan Woodward commented on LUCENE-8273:
---

I think I have now chased down the last failures, all around end() propagation 
and how to deal with position increments when FilteredTermFilter is skipped.  
Attached is a patch that includes [~steve_rowe]'s test improvements.  I'll 
commit this now, and un-awaitsfix TestRandomChains and see if that throws out 
any new bugs in the next while.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478208#comment-16478208
 ] 

Robert Muir commented on LUCENE-8273:
-

{quote}
In {{TestConditionalFilter}}, converted most {{CannedTokenStream}}'s to 
{{MockTokenizer}}'s, which causes error {{IllegalStateException: end() called 
in wrong state=END!}} - I'm guessing you already know about this and are 
working on it
{quote}

Good approach, TestRandomChains is rather inefficient (basically an integration 
test) and its best to always make the fails reproduce with simpler unit tests.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-16 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478177#comment-16478177
 ] 

Steve Rowe commented on LUCENE-8273:


bq. I'm attaching my latest patch, which includes the failing seed. Steve Rowe, 
can you rebase on top of this?

Done: [^LUCENE-8273-part2-rebased.patch].  (I haven't run tests or precommit 
though.)

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-16 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478105#comment-16478105
 ] 

Steve Rowe commented on LUCENE-8273:


bq. Every time I think I have this fixed, TestRandomChains finds another 
failure... I'm attaching my latest patch, which includes the failing seed. 
Steve Rowe, can you rebase on top of this?

Sure, will do.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-16 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477910#comment-16477910
 ] 

Alan Woodward commented on LUCENE-8273:
---

Every time I think I have this fixed, TestRandomChains finds another failure... 
 I'm attaching my latest patch, which includes the failing seed.  
[~steve_rowe], can you rebase on top of this?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-16 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477590#comment-16477590
 ] 

Steve Rowe commented on LUCENE-8273:


Attaching an updated version of my part2 patch.  Changes:
* Added a Solr test and ref guide text for {{ProtectedTermFilterFactory}}
* Moved the failing read-ahead test ved over to {{TestConditionalFilter}}
* In {{TestConditionalFilter}}, converted most {{CannedTokenStream}}'s to 
{{MockTokenizer}}'s, which causes error {{IllegalStateException: end() called 
in wrong state=END!}} - I'm guessing you already know about this and are 
working on it
* Except for the {{end()}}-related failures, tests succeed
* Precommit succeeds


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-16 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477116#comment-16477116
 ] 

Alan Woodward commented on LUCENE-8273:
---

There's a bug in the way that tokens are buffered if the wrapped TokenFilter 
needs to read ahead.  I'm working on a fix for that now.

TestRandomChains has found quite a few problems with this, I'm tempted to back 
it out and work on a branch for a while as it's clearly not ready for release 
yet.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-14 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475166#comment-16475166
 ] 

Steve Rowe commented on LUCENE-8273:


I stumbled on what looks like a {{ProtectedTermFilter}} bug when a wrapped 
filter is a filtering token filter, and the content to be analyzed contains at 
least one non-protected term prior to a protected term; in this case protection 
fails:

{code:java|title=TestProtectedTerm.java}
  public void testWrappedFilteringTokenFilter() throws IOException {
CharArraySet protectedTerms = new CharArraySet(5, true);
protectedTerms.add("foobar");
TokenStream stream = whitespaceMockTokenizer("foobar abc");
stream = new ProtectedTermFilter(protectedTerms, stream, in -> new 
LengthFilter(in, 1, 4));
assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // 
succeeds

stream = whitespaceMockTokenizer("wuthering foobar abc");
stream = new ProtectedTermFilter(protectedTerms, stream, in -> new 
LengthFilter(in, 1, 4));
assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // 
fails @ term 0: Actual: abc
  }
{code}

I haven't yet figured out what the problem is.  Alan, do you understand what's 
happening here?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-14 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474120#comment-16474120
 ] 

Steve Rowe commented on LUCENE-8273:


Okay, I'll finish up the remaining work today.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-14 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473985#comment-16473985
 ] 

Alan Woodward commented on LUCENE-8273:
---

Thanks Steve, your patch looks great.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472873#comment-16472873
 ] 

Steve Rowe commented on LUCENE-8273:


I've belatedly reviewed the changes committed under this issue - awesome 
additions!

I noticed a few small things that could be improved, and in putting together a 
patch I found what I think is a bug/omission in the design: 
{{TermExclusionFilterFactory}} can't be used outside of {{CustomAnalyzer}} 
because {{ConditionalTokenFilterFactory.inform()}} expects 
{{setInnerFilters()}} to have already been called, but 
{{TermExclusionFilterFactory}} doesn't have the ability to do that. As a 
result, the protected word list can't be loaded. AFAICT this will prevent 
{{TermExclusionFilterFactory}} from being used in Solr, e.g.

My idea to address the problem is to allow {{TermExclusionFilterFactory}} to 
accept (but not require) as args wrapped filters and those filters' args, and 
then call {{setInnerFilters()}} from the ctor. I've added a factory test suite 
that appears to validate the idea. See javadocs in the patch for interface 
details.

Because of the way I produced it, the patch (forthcoming) includes several 
other changes, some of which may be controversial. I'd be happy to factor out 
any of these changes if you don't agree with all of them, [~romseygeek]. The 
other changes, in no particular order:
 * I think {{TermExclusionFilter}} should be renamed to {{ProtectedTermFilter}} 
(or similar) - forcing devs to deal with *both* "protected" and 
"excluded" seems pointless to me; it should be either one or the other 
everywhere.
 * {{ProtectedTermFilter}} ctor should {{require()}} the protected terms file 
param rather than {{get()}} it, since it is in fact required.
 * Added {{TestConditionalTokenFilter.testMultipleConditionalFilters()}}
 * The boolean sense of {{ConditionalTokenFilter.shouldFilter()}} is backward 
in the class javadocs: it says that shouldFilter:false will execute the wrapped 
filters, but AFAICT it's the opposite.
 * Some other minor javadoc improvements.

If the approach is okay, there is a little bit more work to do: some additional 
testing (more tests for graceful handling of bad input, and a Solr test), and 
adding Solr ref guide doc.  I haven't run precommit or all tests yet, so there 
may be some outstanding problems.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472733#comment-16472733
 ] 

Robert Muir commented on LUCENE-8273:
-

maybe ConditionalTokenFilter's toString() can be improved which will help in 
debugging chains like that.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472709#comment-16472709
 ] 

Alan Woodward commented on LUCENE-8273:
---

Some more failures in testRandomChains:
https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/1885/
https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/1857/

I'm going to AwaitsFix it for the weekend, and look at it again on Monday.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472594#comment-16472594
 ] 

Michael McCandless commented on LUCENE-8273:


+1, very cool addition to Lucene!

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472342#comment-16472342
 ] 

Mike Sokolov commented on LUCENE-8273:
--

Yes, thanks [~romseygeek] this is a great addition

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472235#comment-16472235
 ] 

Robert Muir commented on LUCENE-8273:
-

I opened LUCENE-8308, I think we have to work our way through that one first. 
Thanks for the work here, great improvement.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472206#comment-16472206
 ] 

Alan Woodward commented on LUCENE-8273:
---

bq. Thanks for debugging the failure

There will be more, I'm sure!

bq. Should we open a followup issue to clean up the analyzers and stuff?

+1

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472184#comment-16472184
 ] 

Robert Muir commented on LUCENE-8273:
-

Thanks for debugging the failure: TestRandomChains is cruel but it works. 
Should we open a followup issue to clean up the analyzers and stuff? This is 
something i can help with.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471991#comment-16471991
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit aea073e029f0aa3dd32acb41ba0209c12da9725c in lucene-solr's branch 
refs/heads/branch_7x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aea073e ]

LUCENE-8273: Fix end() propagation


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471992#comment-16471992
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 2225d0e464eeeabcec682edcabd51b136ac1aee2 in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2225d0e ]

LUCENE-8273: Fix end() propagation


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471900#comment-16471900
 ] 

Alan Woodward commented on LUCENE-8273:
---

The elasticsearch CI had some failures due to this:
{code}
ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
-Dtests.seed=EF8BCF910EB1138C -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=es-CR -Dtests.timezone=Asia/Ashgabat -Dtests.asserts=true 
-Dtests.file.encoding=UTF8
{code}

They look to be caused by FingerprintFilter being wrapped in a 
ConditionalTokenStream, which I don't think makes any sense?  So the simplest 
solution is probably to blacklist it.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471732#comment-16471732
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 1ce3ebadbd60d4485145c71b48330d0aabde77ad in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ce3eba ]

LUCENE-8273: Add ConditionalTokenFilter


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471731#comment-16471731
 ] 

ASF subversion and git services commented on LUCENE-8273:
-

Commit 7f13e5642924aa7f97e21b059553cfd6b5b56c21 in lucene-solr's branch 
refs/heads/branch_7x from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7f13e56 ]

LUCENE-8273: Add ConditionalTokenFilter


> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-09 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468547#comment-16468547
 ] 

Alan Woodward commented on LUCENE-8273:
---

New patch, {{whenTerm()}} now takes {{Predicate}}, and 
{{TermExclusionFilterFactory}} uses {{protected}} instead of {{excludeFile}}

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467407#comment-16467407
 ] 

Robert Muir commented on LUCENE-8273:
-

For the new TermExclusionFilterFactory:

{code}
public static final String EXCLUDED_TOKENS = "excludeFile";
{code}

KeywordMarkerFilterFactory currently uses a different parameter name: 
"protected". So does WordDelimiterFilterFactory (which I think was the one that 
inspired this JIRA issue). Maybe we want it to be consistent, to support 
migrating away from that stuff?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-08 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467353#comment-16467353
 ] 

Alan Woodward commented on LUCENE-8273:
---

Thanks David, I'll do that.  Will commit later on if nobody else has any 
comments.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-07 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466309#comment-16466309
 ] 

David Smiley commented on LUCENE-8273:
--

Looks good Alan.
 * Maybe make {{ConditionBuilder whenTerm(Predicate 
predicate)}} even simpler by referring to CharSequence in the predicate instead 
of CharTermAttribute? This is just a convenience method after all; and it keeps 
the caller immune to some lower level details like Attributes. And it somewhat 
prevents them from using mutable methods on CharTermAttribute which would be 
pretty nasty.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-07 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466012#comment-16466012
 ] 

Alan Woodward commented on LUCENE-8273:
---

Patch, up to date with master and passing all tests and precommit.  I think 
this is ready?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-01 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459643#comment-16459643
 ] 

Alan Woodward commented on LUCENE-8273:
---

bq. it seems you need to make the ConditionalTokenFilterFactory implement the 
resourceloaderaware stuff always

Just spotted Robert's comment here, I've added a new patch which fixes this.  
Thanks!

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-01 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459564#comment-16459564
 ] 

Alan Woodward commented on LUCENE-8273:
---

Updated patch.  {{ConditionalTokenFilterFactory}} is now a top-level class, 
distinct from {{ConditionBuilder}}.  I've added a TermExclusionFilter that 
accepts a list of terms and only runs its child filters if the current token is 
not in its list, and demonstrated how to use it in TestCustomAnalyzer.  At the 
moment it just reads a word file, but we can expand it to accept patterns or a 
directly passed in list of terms in follow ups.  I've also changed the 
CustomAnalyzerBuilder to use {{when}} rather than {{ifXXX}} - thanks for the 
suggestion Steve!

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459103#comment-16459103
 ] 

Steve Rowe commented on LUCENE-8273:


bq. {{if}} is a keyword, unfortunately, so that won't work. Maybe 
{{ifCondition}} instead?

Shorter and almost as good?: {{when}} (used e.g. by Mockito)

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458569#comment-16458569
 ] 

Robert Muir commented on LUCENE-8273:
-

sounds good. yeah i know the resource stuff/keywork marking is tricky, i looked 
at what the existing factory is doing and its pretty crazy. 

it seems you need to make the ConditionalTokenFilterFactory implement the 
resourceloaderaware stuff always, because its separately a bug that the current 
patch will "hide" the resourceloader from anything inside the if? So I think it 
should implement the interface and pass the loader in its inform() method to 
stuff inside. Maybe this leads towards a solution to what you need for the 
conditional part, too.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458550#comment-16458550
 ] 

Alan Woodward commented on LUCENE-8273:
---

{{if}} is a keyword, unfortunately, so that won't work.  Maybe {{ifCondition}} 
instead?

Access to resources will be a bit trickier, I think maybe the best way to do 
that would be to have a specialised method {{ifInList}} or something similar 
that takes a path to a word list.  I'll see what I can come up with.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458507#comment-16458507
 ] 

Robert Muir commented on LUCENE-8273:
-

I'm also curious about common use cases where the condition is just matching a 
list of words, basically what the KeywordMarker factory provides today. Does 
the user have access to stuff like resource loaders to read from files / is it 
intuitive so they won't be reading the list of words in on every token or other 
mistakes? If we can make this simple, I think we can deprecate KeywordMarker 
and many other exception-list-type mechanisms hardcoded in all the filters, 
which would be a really nice cleanup. 

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458501#comment-16458501
 ] 

Robert Muir commented on LUCENE-8273:
-

I like the custom analyzer integration. Can we rename {{ifMatches}} just to 
{{if}}? This makes it more natural for the user to pair with the necessary 
{{endif}}, doesn't imply any regex matching, etc. I think its useful to mention 
the necessary endif in the javadocs for both these methods, if users forget to 
call it they should get a compile error from build(), but it may not be 
obvious. I would also add to the {{ifTerm}} docs that it is just sugar, with 
snippet of how to implement it with if + CharTermAttribute. This gives an 
example in case the user needs to work on some other attribute.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458472#comment-16458472
 ] 

Alan Woodward commented on LUCENE-8273:
---

[~msoko...@gmail.com] in that case, the filter would still get applied, because 
we only check the token that is passed to the synonym filter from outside.  
Anything that's pulled by the synonym filter itself doesn't get checked.  
Although thinking about it, it would be possible to run the check in the 
OneTimeWrapper as well and return 'false' from things that don't pass the 
check.  I'm not sure how that would work with the graph itself though, it might 
end up corrupting things.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458469#comment-16458469
 ] 

Alan Woodward commented on LUCENE-8273:
---

And here's a patch that integrates it into CustomAnalyzer

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-27 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457066#comment-16457066
 ] 

Mike Sokolov commented on LUCENE-8273:
--

Cool! I haven't read the patch carefully, but I looked at the tests, and I
see RandomChains test so this is probably covered, but I was wondering
about the case you mentioned earlier where say there is a multi-token
synonym and shouldFilter() is true for the first token and false for the
second. I guess the filter just does not apply the synonym then?

On Fri, Apr 27, 2018 at 12:44 PM, Alan Woodward (JIRA) 



> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-27 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456717#comment-16456717
 ] 

Alan Woodward commented on LUCENE-8273:
---

I think I fixed it - attached is a patch including a test where we wrap 
SynonymGraphFilter, and everything seems to pass.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455543#comment-16455543
 ] 

Robert Muir commented on LUCENE-8273:
-

just imaging scenarios more, its probably useful if the thing can avoid 
corrupting graphs. By that i mean: conceptually the user has to understand that 
the filtering applies the condition based on the first token and that the 
filter gets whatever it pulls (based on its wanted context), and those are 
provided "graph-aligned" or something. I think its just inherent in what you 
are trying to do and not specific to the implementation: it needs to have some 
restrictions to avoid trouble? So maybe this filter should also consider 
positionLength...

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453852#comment-16453852
 ] 

Robert Muir commented on LUCENE-8273:
-

I feel like the OneTimeWrapper you have is close, it just needs to set the 
boolean on its "parent" after input.incrementToken() ? At that point we know 
the subfilter is "done"

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453818#comment-16453818
 ] 

Robert Muir commented on LUCENE-8273:
-

i mean in the worst case, you could make it "correct" by using something like 
the StackWalker api right? Then we figure out how to make it more efficient.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453814#comment-16453814
 ] 

Robert Muir commented on LUCENE-8273:
-

ok i get it, so basically the filter switches between two states right now with 
that boolean it has. So its basically hardcoding that the passed filter will 
only call incrementToken once with that. 

maybe it could do this differently, like insert an "end-if" filter around the 
one passed in by the user, which would signal completion?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-26 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453807#comment-16453807
 ] 

Alan Woodward commented on LUCENE-8273:
---

Yes, exactly that.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453806#comment-16453806
 ] 

Robert Muir commented on LUCENE-8273:
-

So you mean that the problem is not related to buffering state, its more that 
it only expects one call to incrementToken() and does the wrong thing if it 
gets called additional times? Sorry I'm slow here, i wanted to play with the 
patch last night but didn't have the time.

 

 

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-25 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452676#comment-16452676
 ] 

Alan Woodward commented on LUCENE-8273:
---

bq. You mean any filter that uses captureState?

capture/restoreState works fine, the problem comes when you get a filter that 
needs to look ahead in the tokenstream, so for example if SynonymGraphFilter 
has a multiword synonym "a b c -> d", and you hit token "a", then the filter 
pulls in two more tokens to see if it matches the whole synonym; but 
ConditionalTokenFilter only allows you to pull in one token at a time, because 
it needs to distinguish between incrementToken() as called by the next filter 
down the line, and incrementToken() as called by its delegate.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-25 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452431#comment-16452431
 ] 

Mike Sokolov commented on LUCENE-8273:
--

Hmm right. Well you might be able to use that idea to detect illegal states at 
least

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-25 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452428#comment-16452428
 ] 

Robert Muir commented on LUCENE-8273:
-

{quote}
This latter one is a bit clunky, as this TokenFilter won't work with filters 
that consume more than one token at a time - eg ShingleFilter or 
SynonymGraphFilter. At the moment I have a blacklist, but there may be a better 
way of isolating that - preferably one that throws errors when you build the 
TokenStream. Speak up if you have any suggestions.
{quote}

I don't understand the problem yet, but I will try to fight with it. You mean 
any filter that uses captureState?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org