[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693360#comment-16693360 ] ASF subversion and git services commented on LUCENE-8497: - Commit 65486442c4a893a17cd70c9a865fa1af7c160aa3 in lucene-solr's branch refs/heads/jira/http2 from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6548644 ] LUCENE-8497: Replace MultiTermAwareComponent with normalize() method > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: master (8.0) > > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693361#comment-16693361 ] ASF subversion and git services commented on LUCENE-8497: - Commit c2bd3aed22b439168fb2bfadcdcee4fed09e4ff7 in lucene-solr's branch refs/heads/jira/http2 from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bd3ae ] LUCENE-8497: Fix reference to MultiTermAwareComponenent in Solr reference guide > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: master (8.0) > > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691624#comment-16691624 ] ASF subversion and git services commented on LUCENE-8497: - Commit c2bd3aed22b439168fb2bfadcdcee4fed09e4ff7 in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bd3ae ] LUCENE-8497: Fix reference to MultiTermAwareComponenent in Solr reference guide > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Fix For: master (8.0) > > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691506#comment-16691506 ] ASF subversion and git services commented on LUCENE-8497: - Commit 65486442c4a893a17cd70c9a865fa1af7c160aa3 in lucene-solr's branch refs/heads/master from [~romseygeek] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6548644 ] LUCENE-8497: Replace MultiTermAwareComponent with normalize() method > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676945#comment-16676945 ] Erick Erickson commented on LUCENE-8497: LGTM +1 > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676578#comment-16676578 ] Alan Woodward commented on LUCENE-8497: --- I plan on committing this soon - any objections, speak up now... > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669476#comment-16669476 ] Lucene/Solr QA commented on LUCENE-8497: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 7 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 12s{color} | {color:green} common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 28s{color} | {color:green} icu in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 50s{color} | {color:green} kuromoji in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 48m 35s{color} | {color:green} core in the patch passed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 61m 10s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | LUCENE-8497 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946196/LUCENE-8497.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns | | uname | Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / 856e28d | | ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 | | Default Java | 1.8.0_172 | | Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/114/testReport/ | | modules | C: lucene/analysis/common lucene/analysis/icu lucene/analysis/kuromoji solr/core U: . | | Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/114/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668544#comment-16668544 ] Alan Woodward commented on LUCENE-8497: --- New patch, this time actually removing the outdated test... > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, > LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667649#comment-16667649 ] Lucene/Solr QA commented on LUCENE-8497: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 6 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 13s{color} | {color:green} common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 29s{color} | {color:green} icu in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 59s{color} | {color:green} kuromoji in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 46m 35s{color} | {color:red} core in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 59m 21s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | solr.schema.MultiTermTest | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | LUCENE-8497 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946072/LUCENE-8497.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns | | uname | Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / e618e83 | | ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 | | Default Java | 1.8.0_172 | | unit | https://builds.apache.org/job/PreCommit-LUCENE-Build/113/artifact/out/patch-unit-solr_core.txt | | Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/113/testReport/ | | modules | C: lucene/analysis/common lucene/analysis/icu lucene/analysis/kuromoji solr/core U: . | | Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/113/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667484#comment-16667484 ] Alan Woodward commented on LUCENE-8497: --- Updated patch. I've enabled the QA bot because I can't get the Solr tests to pass locally, but I think I've chased down all the test failures that were due to this change. > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654550#comment-16654550 ] Erick Erickson commented on LUCENE-8497: For crying out loud, I wrote that 11 years ago and you expect me to remember why ;) OK, I'll get serious now. The origin of MultTermAware but was to allow us to apply some filters for wildcard queries, it all started with LowerCaseFilter. I got really tired of explaining to users that "Sol*" didn't find "solr" because terms with wildcards were unanalyzed. As long as that behavior is retained that test can be removed for all of me. It's pretty out of date, it only verifies that a few of the filters that implement that interface anyway. So remove it if you see fit. A more effective test of the behavior I care about would be determining if all the filters that implement that interface properly work with, say, wildcards in the search term. > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch, LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling
[ https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651403#comment-16651403 ] Alan Woodward commented on LUCENE-8497: --- Thanks for the pull request [~mayyas]! I've extended it to cover Solr as well, which allows us to remove MultiTermAwareComponent entirely. This causes some test failures in Solr's MultiTermTest, but seeing as these explicitly testing implementation (which has now changed), and the behaviour itself is tested elsewhere in eg TestFoldingMultitermQuery I think we should be OK to just remove this test? [~erickerickson] you wrote the tests originally, does that sound reasonable to you? > Rethink multi-term analysis handling > > > Key: LUCENE-8497 > URL: https://issues.apache.org/jira/browse/LUCENE-8497 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Alan Woodward >Priority: Major > Attachments: LUCENE-8497.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The current framework for handling term normalisation works via instanceof > checks for MultiTermAwareComponent and casts. MultiTermAwareComponent itself > deals in AbstractAnalysisComponents, and so callers need to cast to the > correct component type before use, which is ripe for misuse. > We should re-organise all this to be type-safe and usable without casts. One > possibility is to add `normalize` methods to CharFilterFactory and > TokenFilterFactory that mirror their existing `create` methods. The default > implementation would return the input unchanged, while filters that should > apply at normalization time can delegate to `create`. > Related to this, we should deprecate and remove LowerCaseTokenizer, which > combines tokenization and normalization in a way that will break this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org