[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504577#comment-16504577 ] Alan Woodward commented on LUCENE-8273: --- I'll let RandomChains stew for another 24 hours, but I think this is ready to be resolved for the upcoming 7.4 release. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504576#comment-16504576 ] ASF subversion and git services commented on LUCENE-8273: - Commit a4fa16896225e08b72bf64fba97a216bb6a83fbb in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4fa168 ] LUCENE-8273: Don't wrap ShingleFilter in conditions in testRandomChains > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504575#comment-16504575 ] ASF subversion and git services commented on LUCENE-8273: - Commit dfb679a93cc9aae2620e7ce110c9515181bd17c6 in lucene-solr's branch refs/heads/branch_7x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dfb679a ] LUCENE-8273: Don't wrap ShingleFilter in conditions in testRandomChains > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504560#comment-16504560 ] Alan Woodward commented on LUCENE-8273: --- Yeah, it's the same issue as LUCENE-4170. I'll add it to the blacklist. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504554#comment-16504554 ] Robert Muir commented on LUCENE-8273: - If shinglefilter is the buggy one it should be banned from the test for sure, and a issue opened for its bugginess. Its not a test coverage concern: this can't replace unit tests: it exists to find new buggy interactions between the analysis components, like this one here. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504540#comment-16504540 ] Alan Woodward commented on LUCENE-8273: --- Both of these failures are due to ShingleFilter not properly handling graphs. Without being wrapped in a condition, the ShingleFilter is mangling its input graph, but it's doing it in a consistent way, so the ValidatingTokenFilter is happy. However, if it's randomly turned off, then occasionally the ValidatingTokenFilter gets the plain input graph as opposed to the mangled one, and so it complains because offsets are no longer consistent. I'm not quite sure how best to fix this. Ideally, we'd just fix ShingleFilter, but that's not as simple as it sounds. Perhaps the simplest thing to do is to add ShingleFilter to the blacklist, and document that ConditionalTokenFilter won't work with broken graph inputs? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504369#comment-16504369 ] Alan Woodward commented on LUCENE-8273: --- A couple more failing seeds: {code} Suite: org.apache.lucene.analysis.core.TestRandomChains 07:41:21[junit4] 2> TEST FAIL: useCharFilter=true text='t \u0af5\u0a9f\u0acb\u0ada\u0aa6 \u0011\u02eb^ q hnhpwei txslx e \u22c8\u22d9 \u2e06\u2e15\u2e6a\u2e05 uv im i \u1387\u1391\u1398\u1386\u138c j' 07:41:21[junit4] 2> Exception from random analyzer: 07:41:21[junit4] 2> charfilters= 07:41:21[junit4] 2> org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@495f18ce) 07:41:21[junit4] 2> tokenizer= 07:41:21[junit4] 2> org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@37e10425) 07:41:21[junit4] 2> filters=ConditionalTokenFilter: 07:41:21[junit4] 2> org.apache.lucene.analysis.MockGraphTokenFilter(java.util.Random@4d3a58d5, OneTimeWrapper@7c12a0d4 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter: 07:41:21[junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@75d26c0e term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, ) 07:41:21[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=A61F0C126076A16B -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=ar-TN -Dtests.timezone=US/Arizona -Dtests.asserts=true -Dtests.file.encoding=UTF-8 07:41:21[junit4] ERROR 0.55s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<< 07:41:21[junit4]> Throwable #1: java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=10: 41 vs 55; token=L寂,釟'Ƈ⨄{ঠ֍ {code} {code} Suite: org.apache.lucene.analysis.core.TestRandomChains 15:11:36[junit4] 2> TEST FAIL: useCharFilter=true text='\u2d8e\u2dbb\u2daf\u2d8b\u2d97\u2dd5\u2d97\u2dcc \u035b\u5996\u07ca\u0003\u12e3\u6450\uf36f ' 15:11:36[junit4] 2> Exception from random analyzer: 15:11:36[junit4] 2> charfilters= 15:11:36[junit4] 2> tokenizer= 15:11:36[junit4] 2> org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer() 15:11:36[junit4] 2> filters= 15:11:36[junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@3f23768a term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, lbay)ConditionalTokenFilter: 15:11:36[junit4] 2> org.apache.lucene.analysis.SimplePayloadFilter(OneTimeWrapper@114147eb term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null)ConditionalTokenFilter: 15:11:36[junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(OneTimeWrapper@680fa1a5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,payload=null, 7) 15:11:36[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=49B6EB935C8F8B35 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=nl-NL -Dtests.timezone=BST -Dtests.asserts=true -Dtests.file.encoding=US-ASCII 15:11:36[junit4] ERROR 0.05s J2 | TestRandomChains.testRandomChains <<< 15:11:36[junit4]> Throwable #1: java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=9: 12 vs 14; token=ߊ ዣዣ {code} I'm looking at these now > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493678#comment-16493678 ] ASF subversion and git services commented on LUCENE-8273: - Commit 8577aee08f8cf348a674dfbcc7b408358380d260 in lucene-solr's branch refs/heads/branch_7x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8577aee ] LUCENE-8273: Adjust position increments when filtering stacked tokens > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493674#comment-16493674 ] ASF subversion and git services commented on LUCENE-8273: - Commit 4ea9d2ea8cbb036bac6aa1e61161afc65d04a1be in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ea9d2e ] LUCENE-8273: Adjust position increments when filtering stacked tokens > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491222#comment-16491222 ] Steve Rowe commented on LUCENE-8273: Not sure if this is the same sort of problem you already know about, [~romseygeek], but my Jenkins found a reproducing TestRandomChains master seed for a chain including ConditionalTokenFilter: {noformat} Checking out Revision 18ad8d137afa8e2017f4121ddced4d630b1c86a1 (refs/remotes/origin/master) [...] [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='\u0bd2\u0ba7\u0bdc\u0bf5\u0b96\u0ba5 qrhnhrlyeh \u6d41\u0601\u033a\ue4e1 bvmoocaycg dtazdd fn]|b({0,5 vjsrn' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.standard.ClassicTokenizer() [junit4] 2> filters=ConditionalTokenFilter: [junit4] 2> org.apache.lucene.analysis.miscellaneous.TypeAsSynonymFilter(OneTimeWrapper@588fa3b0 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, tzqrpnf) [junit4] 2> org.apache.lucene.analysis.no.NorwegianLightStemFilter(ValidatingTokenFilter@63bdcc8b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false)ConditionalTokenFilter: [junit4] 2> org.apache.lucene.analysis.core.TypeTokenFilter(OneTimeWrapper@28da9033 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false, [, , ], true) [junit4] 2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=735F260A3B6964C3 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=ca -Dtests.timezone=Europe/Isle_of_Man -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 39.0s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<< [junit4]> Throwable #1: java.lang.IllegalStateException: last stage: inconsistent startOffset at pos=6: 48 vs 53; token=tzqrpnf [junit4]>at __randomizedtesting.SeedInfo.seed([735F260A3B6964C3:1904991B62274430]:0) [junit4]>at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:106) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:746) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:657) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:559) [junit4]>at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:882) [junit4]>at java.lang.Thread.run(Thread.java:748) [junit4] 2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/jobs/Lucene-Solr-Nightly-master/workspace/lucene/build/analysis/common/test/J2/temp/lucene.analysis.core.TestRandomChains_735F260A3B6964C3-001 [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, docValues:{}, maxPointsInLeafNode=1103, maxMBSortInHeap=7.435649827990908, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@5743798e), locale=ca, timezone=Europe/Isle_of_Man [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_151 (64-bit)/cpus=16,threads=1,free=293435744,total=411566080 {noformat} > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. --
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489903#comment-16489903 ] ASF subversion and git services commented on LUCENE-8273: - Commit 23ab3d19ab0239bae4cc0acdbc01a8b65c637d68 in lucene-solr's branch refs/heads/branch_7x from [~steve_rowe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=23ab3d1 ] LUCENE-8273: Move test resources to where they belong > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489904#comment-16489904 ] ASF subversion and git services commented on LUCENE-8273: - Commit 2f38342687f1af3a092f1fbfa183a09a55490cb3 in lucene-solr's branch refs/heads/master from [~steve_rowe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f38342 ] LUCENE-8273: Move test resources to where they belong > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486886#comment-16486886 ] Alan Woodward commented on LUCENE-8273: --- The elastic CI has found some reproducing seeds in TestRandomChains that look like the following: {code} Suite: org.apache.lucene.analysis.core.TestRandomChains 01:47:39[junit4] 2> Exception from random analyzer: 01:47:39[junit4] 2> charfilters= 01:47:39[junit4] 2> org.apache.lucene.analysis.fa.PersianCharFilter(java.io.StringReader@36de1051) 01:47:39[junit4] 2> org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@31483c67, org.apache.lucene.analysis.fa.PersianCharFilter@51a9d324) 01:47:39[junit4] 2> tokenizer= 01:47:39[junit4] 2> org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@27232fb3, 35) 01:47:39[junit4] 2> filters=ConditionalTokenFilter: 01:47:39[junit4] 2> org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter(OneTimeWrapper@5f621e45 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, org.apache.lucene.analysis.compound.hyphenation.HyphenationTree@40cdd67e)ConditionalTokenFilter: 01:47:39[junit4] 2> org.apache.lucene.analysis.in.IndicNormalizationFilter(OneTimeWrapper@2de2e47c term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter: 01:47:39[junit4] 2> org.apache.lucene.analysis.MockRandomLookaheadTokenFilter(java.util.Random@4ced13ac, OneTimeWrapper@7d30a80d term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1) 01:47:39[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=72E157E8E16C0F79 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-US -Dtests.timezone=America/Anguilla -Dtests.asserts=true -Dtests.file.encoding=US-ASCII 01:47:39[junit4] FAILURE 0.57s J0 | TestRandomChains.testRandomChainsWithLargeStrings <<< 01:47:39[junit4]> Throwable #1: java.lang.AssertionError 01:47:39[junit4]> at __randomizedtesting.SeedInfo.seed([72E157E8E16C0F79:18BAE8F9B8222F8A]:0) 01:47:39[junit4]> at org.apache.lucene.analysis.LookaheadTokenFilter.peekToken(LookaheadTokenFilter.java:140) {code} The root cause is that LookaheadTokenFilter doesn't play well with ConditionalTokenFilter when we have stacked tokens: - CTF works by presenting the underlying TokenStream to its wrapped filter as a series of snippets, demarcated by tokens that don't pass the {{shouldFilter()}} test. When a new snippet is started (i.e. when a token that passes {{shouldFilter()}} appears after one that doesn't) then {{reset()}} is called on the delegate, and when it stops (i.e. when a token that doesn't pass {{shouldFilter()}} appears) then {{end()}} is called. - This means that if we have stacked tokens, with the first not passing {{shouldFilter()}} and the second passing it, the wrapped filter can see a TokenStream that has an initial position increment of 0 - LookaheadTokenFilter has an explicit assertion that checks we don't have an initial posInc of 0 I think this can be fixed by having a posInc adjustment when we're delegating, so that the delegated snippet starts with a posInc of 1, but this is then adjusted downwards by the CTF before it's emitted. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483624#comment-16483624 ] ASF subversion and git services commented on LUCENE-8273: - Commit 0934e2a998ac43e46594e049daab751d8cae2476 in lucene-solr's branch refs/heads/branch_7x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0934e2a ] LUCENE-8273: Don't wrap MinHashFilter in a condition MinHashFilter needs to consume the entire tokenstream, so wrapping it in a randomized condition makes no sense, and breaks offsets. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483625#comment-16483625 ] ASF subversion and git services commented on LUCENE-8273: - Commit 24c186eff9a9b2b2c0a86fc0a828bd81ba0993e8 in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=24c186e ] LUCENE-8273: Don't wrap MinHashFilter in a condition MinHashFilter needs to consume the entire tokenstream, so wrapping it in a randomized condition makes no sense, and breaks offsets. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482583#comment-16482583 ] ASF subversion and git services commented on LUCENE-8273: - Commit 0c0fce3e98c9a01c330329eca5153fb78c7decaf in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0c0fce3 ] LUCENE-8273: TestRandomChains found some more end() handling problems > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482582#comment-16482582 ] ASF subversion and git services commented on LUCENE-8273: - Commit a69321a4d05d30f06248d0a33a237d8978942a9f in lucene-solr's branch refs/heads/branch_7x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a69321a ] LUCENE-8273: TestRandomChains found some more end() handling problems > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482085#comment-16482085 ] ASF subversion and git services commented on LUCENE-8273: - Commit d91273ddf0b524e14ebcc2a90bbedd8d9ae319d4 in lucene-solr's branch refs/heads/master from [~steve_rowe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d91273d ] LUCENE-8273: Rename TermExclusionFilter -> ProtectedTermFilter. Allow ProtectedTermFilterFactory to be used outside of CustomAnalyzer, including in Solr, by allowing wrapped filters and their parameters to be specified on construction. Add tests for ProtectedTermFilterFactory in lucene/common/analysis/ and in solr/core/. Add Solr ref guide documentation for ProtectedTermFilterFactory. Improve javadocs for CustomAnalyzer, ConditionalTokenFilter, and ProtectedTermFilter. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482084#comment-16482084 ] ASF subversion and git services commented on LUCENE-8273: - Commit 97490ac7c66313625c9526a466ab96981e249746 in lucene-solr's branch refs/heads/branch_7x from [~steve_rowe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=97490ac ] LUCENE-8273: Rename TermExclusionFilter -> ProtectedTermFilter. Allow ProtectedTermFilterFactory to be used outside of CustomAnalyzer, including in Solr, by allowing wrapped filters and their parameters to be specified on construction. Add tests for ProtectedTermFilterFactory in lucene/common/analysis/ and in solr/core/. Add Solr ref guide documentation for ProtectedTermFilterFactory. Improve javadocs for CustomAnalyzer, ConditionalTokenFilter, and ProtectedTermFilter. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482079#comment-16482079 ] Steve Rowe commented on LUCENE-8273: Attached remaining parts from my -rebased patch. Tests and precommit pass. Committing shortly. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480717#comment-16480717 ] Steve Rowe commented on LUCENE-8273: {quote} Would you like me to make a new patch from the other stuff in my patch? Yes please! Feel free to commit it if you think it's ready, I wasn't sure if you still wanted to add more testing or docs. {quote} OK, will do. Yeah, I think it's ready to go, but I'll re-run precommit and tests before I commit. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480712#comment-16480712 ] Alan Woodward commented on LUCENE-8273: --- bq. Would you like me to make a new patch from the other stuff in my patch? Yes please! Feel free to commit it if you think it's ready, I wasn't sure if you still wanted to add more testing or docs. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480698#comment-16480698 ] Steve Rowe commented on LUCENE-8273: bq. Attached is a patch that includes Steve Rowe's test improvements. Would you like me to make a new patch from the other stuff in my patch? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480594#comment-16480594 ] ASF subversion and git services commented on LUCENE-8273: - Commit 443bd2ac526a40079e9028a5e7d8ad2f24ca6a4b in lucene-solr's branch refs/heads/branch_7x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=443bd2a ] LUCENE-8273: Fix end() and posInc handling > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480595#comment-16480595 ] ASF subversion and git services commented on LUCENE-8273: - Commit b1ee23c525a64017242148bca4368fe1be3a in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b1ee23c ] LUCENE-8273: Fix end() and posInc handling > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480574#comment-16480574 ] Alan Woodward commented on LUCENE-8273: --- Nope, that's a relic from a failed attempt to fix something, and precommit has just pulled me up on it not having any javadocs :) Will nuke it before I commit. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480559#comment-16480559 ] Robert Muir commented on LUCENE-8273: - is the TokenBuffer class in the patch actually used? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480549#comment-16480549 ] Alan Woodward commented on LUCENE-8273: --- I think I have now chased down the last failures, all around end() propagation and how to deal with position increments when FilteredTermFilter is skipped. Attached is a patch that includes [~steve_rowe]'s test improvements. I'll commit this now, and un-awaitsfix TestRandomChains and see if that throws out any new bugs in the next while. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, > LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478208#comment-16478208 ] Robert Muir commented on LUCENE-8273: - {quote} In {{TestConditionalFilter}}, converted most {{CannedTokenStream}}'s to {{MockTokenizer}}'s, which causes error {{IllegalStateException: end() called in wrong state=END!}} - I'm guessing you already know about this and are working on it {quote} Good approach, TestRandomChains is rather inefficient (basically an integration test) and its best to always make the fails reproduce with simpler unit tests. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478177#comment-16478177 ] Steve Rowe commented on LUCENE-8273: bq. I'm attaching my latest patch, which includes the failing seed. Steve Rowe, can you rebase on top of this? Done: [^LUCENE-8273-part2-rebased.patch]. (I haven't run tests or precommit though.) > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch, > LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478105#comment-16478105 ] Steve Rowe commented on LUCENE-8273: bq. Every time I think I have this fixed, TestRandomChains finds another failure... I'm attaching my latest patch, which includes the failing seed. Steve Rowe, can you rebase on top of this? Sure, will do. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477910#comment-16477910 ] Alan Woodward commented on LUCENE-8273: --- Every time I think I have this fixed, TestRandomChains finds another failure... I'm attaching my latest patch, which includes the failing seed. [~steve_rowe], can you rebase on top of this? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2.patch, > LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477590#comment-16477590 ] Steve Rowe commented on LUCENE-8273: Attaching an updated version of my part2 patch. Changes: * Added a Solr test and ref guide text for {{ProtectedTermFilterFactory}} * Moved the failing read-ahead test ved over to {{TestConditionalFilter}} * In {{TestConditionalFilter}}, converted most {{CannedTokenStream}}'s to {{MockTokenizer}}'s, which causes error {{IllegalStateException: end() called in wrong state=END!}} - I'm guessing you already know about this and are working on it * Except for the {{end()}}-related failures, tests succeed * Precommit succeeds > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477116#comment-16477116 ] Alan Woodward commented on LUCENE-8273: --- There's a bug in the way that tokens are buffered if the wrapped TokenFilter needs to read ahead. I'm working on a fix for that now. TestRandomChains has found quite a few problems with this, I'm tempted to back it out and work on a branch for a while as it's clearly not ready for release yet. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475166#comment-16475166 ] Steve Rowe commented on LUCENE-8273: I stumbled on what looks like a {{ProtectedTermFilter}} bug when a wrapped filter is a filtering token filter, and the content to be analyzed contains at least one non-protected term prior to a protected term; in this case protection fails: {code:java|title=TestProtectedTerm.java} public void testWrappedFilteringTokenFilter() throws IOException { CharArraySet protectedTerms = new CharArraySet(5, true); protectedTerms.add("foobar"); TokenStream stream = whitespaceMockTokenizer("foobar abc"); stream = new ProtectedTermFilter(protectedTerms, stream, in -> new LengthFilter(in, 1, 4)); assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // succeeds stream = whitespaceMockTokenizer("wuthering foobar abc"); stream = new ProtectedTermFilter(protectedTerms, stream, in -> new LengthFilter(in, 1, 4)); assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // fails @ term 0: Actual: abc } {code} I haven't yet figured out what the problem is. Alan, do you understand what's happening here? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474120#comment-16474120 ] Steve Rowe commented on LUCENE-8273: Okay, I'll finish up the remaining work today. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473985#comment-16473985 ] Alan Woodward commented on LUCENE-8273: --- Thanks Steve, your patch looks great. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472873#comment-16472873 ] Steve Rowe commented on LUCENE-8273: I've belatedly reviewed the changes committed under this issue - awesome additions! I noticed a few small things that could be improved, and in putting together a patch I found what I think is a bug/omission in the design: {{TermExclusionFilterFactory}} can't be used outside of {{CustomAnalyzer}} because {{ConditionalTokenFilterFactory.inform()}} expects {{setInnerFilters()}} to have already been called, but {{TermExclusionFilterFactory}} doesn't have the ability to do that. As a result, the protected word list can't be loaded. AFAICT this will prevent {{TermExclusionFilterFactory}} from being used in Solr, e.g. My idea to address the problem is to allow {{TermExclusionFilterFactory}} to accept (but not require) as args wrapped filters and those filters' args, and then call {{setInnerFilters()}} from the ctor. I've added a factory test suite that appears to validate the idea. See javadocs in the patch for interface details. Because of the way I produced it, the patch (forthcoming) includes several other changes, some of which may be controversial. I'd be happy to factor out any of these changes if you don't agree with all of them, [~romseygeek]. The other changes, in no particular order: * I think {{TermExclusionFilter}} should be renamed to {{ProtectedTermFilter}} (or similar) - forcing devs to deal with *both* "protected" and "excluded" seems pointless to me; it should be either one or the other everywhere. * {{ProtectedTermFilter}} ctor should {{require()}} the protected terms file param rather than {{get()}} it, since it is in fact required. * Added {{TestConditionalTokenFilter.testMultipleConditionalFilters()}} * The boolean sense of {{ConditionalTokenFilter.shouldFilter()}} is backward in the class javadocs: it says that shouldFilter:false will execute the wrapped filters, but AFAICT it's the opposite. * Some other minor javadoc improvements. If the approach is okay, there is a little bit more work to do: some additional testing (more tests for graceful handling of bad input, and a Solr test), and adding Solr ref guide doc. I haven't run precommit or all tests yet, so there may be some outstanding problems. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472733#comment-16472733 ] Robert Muir commented on LUCENE-8273: - maybe ConditionalTokenFilter's toString() can be improved which will help in debugging chains like that. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472709#comment-16472709 ] Alan Woodward commented on LUCENE-8273: --- Some more failures in testRandomChains: https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/1885/ https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/1857/ I'm going to AwaitsFix it for the weekend, and look at it again on Monday. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472594#comment-16472594 ] Michael McCandless commented on LUCENE-8273: +1, very cool addition to Lucene! > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472342#comment-16472342 ] Mike Sokolov commented on LUCENE-8273: -- Yes, thanks [~romseygeek] this is a great addition > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472235#comment-16472235 ] Robert Muir commented on LUCENE-8273: - I opened LUCENE-8308, I think we have to work our way through that one first. Thanks for the work here, great improvement. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472206#comment-16472206 ] Alan Woodward commented on LUCENE-8273: --- bq. Thanks for debugging the failure There will be more, I'm sure! bq. Should we open a followup issue to clean up the analyzers and stuff? +1 > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472184#comment-16472184 ] Robert Muir commented on LUCENE-8273: - Thanks for debugging the failure: TestRandomChains is cruel but it works. Should we open a followup issue to clean up the analyzers and stuff? This is something i can help with. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471991#comment-16471991 ] ASF subversion and git services commented on LUCENE-8273: - Commit aea073e029f0aa3dd32acb41ba0209c12da9725c in lucene-solr's branch refs/heads/branch_7x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aea073e ] LUCENE-8273: Fix end() propagation > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471992#comment-16471992 ] ASF subversion and git services commented on LUCENE-8273: - Commit 2225d0e464eeeabcec682edcabd51b136ac1aee2 in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2225d0e ] LUCENE-8273: Fix end() propagation > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471900#comment-16471900 ] Alan Woodward commented on LUCENE-8273: --- The elasticsearch CI had some failures due to this: {code} ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=EF8BCF910EB1138C -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=es-CR -Dtests.timezone=Asia/Ashgabat -Dtests.asserts=true -Dtests.file.encoding=UTF8 {code} They look to be caused by FingerprintFilter being wrapped in a ConditionalTokenStream, which I don't think makes any sense? So the simplest solution is probably to blacklist it. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: 7.4 > > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471732#comment-16471732 ] ASF subversion and git services commented on LUCENE-8273: - Commit 1ce3ebadbd60d4485145c71b48330d0aabde77ad in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ce3eba ] LUCENE-8273: Add ConditionalTokenFilter > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471731#comment-16471731 ] ASF subversion and git services commented on LUCENE-8273: - Commit 7f13e5642924aa7f97e21b059553cfd6b5b56c21 in lucene-solr's branch refs/heads/branch_7x from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7f13e56 ] LUCENE-8273: Add ConditionalTokenFilter > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468547#comment-16468547 ] Alan Woodward commented on LUCENE-8273: --- New patch, {{whenTerm()}} now takes {{Predicate}}, and {{TermExclusionFilterFactory}} uses {{protected}} instead of {{excludeFile}} > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467407#comment-16467407 ] Robert Muir commented on LUCENE-8273: - For the new TermExclusionFilterFactory: {code} public static final String EXCLUDED_TOKENS = "excludeFile"; {code} KeywordMarkerFilterFactory currently uses a different parameter name: "protected". So does WordDelimiterFilterFactory (which I think was the one that inspired this JIRA issue). Maybe we want it to be consistent, to support migrating away from that stuff? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467353#comment-16467353 ] Alan Woodward commented on LUCENE-8273: --- Thanks David, I'll do that. Will commit later on if nobody else has any comments. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466309#comment-16466309 ] David Smiley commented on LUCENE-8273: -- Looks good Alan. * Maybe make {{ConditionBuilder whenTerm(Predicate predicate)}} even simpler by referring to CharSequence in the predicate instead of CharTermAttribute? This is just a convenience method after all; and it keeps the caller immune to some lower level details like Attributes. And it somewhat prevents them from using mutable methods on CharTermAttribute which would be pretty nasty. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466012#comment-16466012 ] Alan Woodward commented on LUCENE-8273: --- Patch, up to date with master and passing all tests and precommit. I think this is ready? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459643#comment-16459643 ] Alan Woodward commented on LUCENE-8273: --- bq. it seems you need to make the ConditionalTokenFilterFactory implement the resourceloaderaware stuff always Just spotted Robert's comment here, I've added a new patch which fixes this. Thanks! > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459564#comment-16459564 ] Alan Woodward commented on LUCENE-8273: --- Updated patch. {{ConditionalTokenFilterFactory}} is now a top-level class, distinct from {{ConditionBuilder}}. I've added a TermExclusionFilter that accepts a list of terms and only runs its child filters if the current token is not in its list, and demonstrated how to use it in TestCustomAnalyzer. At the moment it just reads a word file, but we can expand it to accept patterns or a directly passed in list of terms in follow ups. I've also changed the CustomAnalyzerBuilder to use {{when}} rather than {{ifXXX}} - thanks for the suggestion Steve! > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459103#comment-16459103 ] Steve Rowe commented on LUCENE-8273: bq. {{if}} is a keyword, unfortunately, so that won't work. Maybe {{ifCondition}} instead? Shorter and almost as good?: {{when}} (used e.g. by Mockito) > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458569#comment-16458569 ] Robert Muir commented on LUCENE-8273: - sounds good. yeah i know the resource stuff/keywork marking is tricky, i looked at what the existing factory is doing and its pretty crazy. it seems you need to make the ConditionalTokenFilterFactory implement the resourceloaderaware stuff always, because its separately a bug that the current patch will "hide" the resourceloader from anything inside the if? So I think it should implement the interface and pass the loader in its inform() method to stuff inside. Maybe this leads towards a solution to what you need for the conditional part, too. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458550#comment-16458550 ] Alan Woodward commented on LUCENE-8273: --- {{if}} is a keyword, unfortunately, so that won't work. Maybe {{ifCondition}} instead? Access to resources will be a bit trickier, I think maybe the best way to do that would be to have a specialised method {{ifInList}} or something similar that takes a path to a word list. I'll see what I can come up with. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458507#comment-16458507 ] Robert Muir commented on LUCENE-8273: - I'm also curious about common use cases where the condition is just matching a list of words, basically what the KeywordMarker factory provides today. Does the user have access to stuff like resource loaders to read from files / is it intuitive so they won't be reading the list of words in on every token or other mistakes? If we can make this simple, I think we can deprecate KeywordMarker and many other exception-list-type mechanisms hardcoded in all the filters, which would be a really nice cleanup. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458501#comment-16458501 ] Robert Muir commented on LUCENE-8273: - I like the custom analyzer integration. Can we rename {{ifMatches}} just to {{if}}? This makes it more natural for the user to pair with the necessary {{endif}}, doesn't imply any regex matching, etc. I think its useful to mention the necessary endif in the javadocs for both these methods, if users forget to call it they should get a compile error from build(), but it may not be obvious. I would also add to the {{ifTerm}} docs that it is just sugar, with snippet of how to implement it with if + CharTermAttribute. This gives an example in case the user needs to work on some other attribute. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458472#comment-16458472 ] Alan Woodward commented on LUCENE-8273: --- [~msoko...@gmail.com] in that case, the filter would still get applied, because we only check the token that is passed to the synonym filter from outside. Anything that's pulled by the synonym filter itself doesn't get checked. Although thinking about it, it would be possible to run the check in the OneTimeWrapper as well and return 'false' from things that don't pass the check. I'm not sure how that would work with the graph itself though, it might end up corrupting things. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458469#comment-16458469 ] Alan Woodward commented on LUCENE-8273: --- And here's a patch that integrates it into CustomAnalyzer > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, > LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457066#comment-16457066 ] Mike Sokolov commented on LUCENE-8273: -- Cool! I haven't read the patch carefully, but I looked at the tests, and I see RandomChains test so this is probably covered, but I was wondering about the case you mentioned earlier where say there is a multi-token synonym and shouldFilter() is true for the first token and false for the second. I guess the filter just does not apply the synonym then? On Fri, Apr 27, 2018 at 12:44 PM, Alan Woodward (JIRA)> Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456717#comment-16456717 ] Alan Woodward commented on LUCENE-8273: --- I think I fixed it - attached is a patch including a test where we wrap SynonymGraphFilter, and everything seems to pass. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455543#comment-16455543 ] Robert Muir commented on LUCENE-8273: - just imaging scenarios more, its probably useful if the thing can avoid corrupting graphs. By that i mean: conceptually the user has to understand that the filtering applies the condition based on the first token and that the filter gets whatever it pulls (based on its wanted context), and those are provided "graph-aligned" or something. I think its just inherent in what you are trying to do and not specific to the implementation: it needs to have some restrictions to avoid trouble? So maybe this filter should also consider positionLength... > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453852#comment-16453852 ] Robert Muir commented on LUCENE-8273: - I feel like the OneTimeWrapper you have is close, it just needs to set the boolean on its "parent" after input.incrementToken() ? At that point we know the subfilter is "done" > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453818#comment-16453818 ] Robert Muir commented on LUCENE-8273: - i mean in the worst case, you could make it "correct" by using something like the StackWalker api right? Then we figure out how to make it more efficient. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453814#comment-16453814 ] Robert Muir commented on LUCENE-8273: - ok i get it, so basically the filter switches between two states right now with that boolean it has. So its basically hardcoding that the passed filter will only call incrementToken once with that. maybe it could do this differently, like insert an "end-if" filter around the one passed in by the user, which would signal completion? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453807#comment-16453807 ] Alan Woodward commented on LUCENE-8273: --- Yes, exactly that. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453806#comment-16453806 ] Robert Muir commented on LUCENE-8273: - So you mean that the problem is not related to buffering state, its more that it only expects one call to incrementToken() and does the wrong thing if it gets called additional times? Sorry I'm slow here, i wanted to play with the patch last night but didn't have the time. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452676#comment-16452676 ] Alan Woodward commented on LUCENE-8273: --- bq. You mean any filter that uses captureState? capture/restoreState works fine, the problem comes when you get a filter that needs to look ahead in the tokenstream, so for example if SynonymGraphFilter has a multiword synonym "a b c -> d", and you hit token "a", then the filter pulls in two more tokens to see if it matches the whole synonym; but ConditionalTokenFilter only allows you to pull in one token at a time, because it needs to distinguish between incrementToken() as called by the next filter down the line, and incrementToken() as called by its delegate. > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452431#comment-16452431 ] Mike Sokolov commented on LUCENE-8273: -- Hmm right. Well you might be able to use that idea to detect illegal states at least > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452428#comment-16452428 ] Robert Muir commented on LUCENE-8273: - {quote} This latter one is a bit clunky, as this TokenFilter won't work with filters that consume more than one token at a time - eg ShingleFilter or SynonymGraphFilter. At the moment I have a blacklist, but there may be a better way of isolating that - preferably one that throws errors when you build the TokenStream. Speak up if you have any suggestions. {quote} I don't understand the problem yet, but I will try to fight with it. You mean any filter that uses captureState? > Add a ConditionalTokenFilter > > > Key: LUCENE-8273 > URL: https://issues.apache.org/jira/browse/LUCENE-8273 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8273.patch, LUCENE-8273.patch > > > Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter > in such a way that it could optionally be bypassed based on the current state > of the TokenStream. This could be used to, for example, only apply > WordDelimiterFilter to terms that contain hyphens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org