[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-11-20 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693360#comment-16693360
 ] 

ASF subversion and git services commented on LUCENE-8497:
-

Commit 65486442c4a893a17cd70c9a865fa1af7c160aa3 in lucene-solr's branch 
refs/heads/jira/http2 from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6548644 ]

LUCENE-8497: Replace MultiTermAwareComponent with normalize() method


> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-11-20 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693361#comment-16693361
 ] 

ASF subversion and git services commented on LUCENE-8497:
-

Commit c2bd3aed22b439168fb2bfadcdcee4fed09e4ff7 in lucene-solr's branch 
refs/heads/jira/http2 from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bd3ae ]

LUCENE-8497: Fix reference to MultiTermAwareComponenent in Solr reference guide


> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-11-19 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691624#comment-16691624
 ] 

ASF subversion and git services commented on LUCENE-8497:
-

Commit c2bd3aed22b439168fb2bfadcdcee4fed09e4ff7 in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bd3ae ]

LUCENE-8497: Fix reference to MultiTermAwareComponenent in Solr reference guide


> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-11-19 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691506#comment-16691506
 ] 

ASF subversion and git services commented on LUCENE-8497:
-

Commit 65486442c4a893a17cd70c9a865fa1af7c160aa3 in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6548644 ]

LUCENE-8497: Replace MultiTermAwareComponent with normalize() method


> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-11-06 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676945#comment-16676945
 ] 

Erick Erickson commented on LUCENE-8497:


LGTM +1

> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-11-06 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676578#comment-16676578
 ] 

Alan Woodward commented on LUCENE-8497:
---

I plan on committing this soon - any objections, speak up now...

> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-10-30 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669476#comment-16669476
 ] 

Lucene/Solr QA commented on LUCENE-8497:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 7 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
16s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
12s{color} | {color:green} common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
28s{color} | {color:green} icu in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
50s{color} | {color:green} kuromoji in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 48m 
35s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 61m 10s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-8497 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946196/LUCENE-8497.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon 
Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 856e28d |
| ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 |
| Default Java | 1.8.0_172 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/114/testReport/ |
| modules | C: lucene/analysis/common lucene/analysis/icu 
lucene/analysis/kuromoji solr/core U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/114/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-10-30 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668544#comment-16668544
 ] 

Alan Woodward commented on LUCENE-8497:
---

New patch, this time actually removing the outdated test...

> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch, 
> LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-10-29 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667649#comment-16667649
 ] 

Lucene/Solr QA commented on LUCENE-8497:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
6s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
13s{color} | {color:green} common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
29s{color} | {color:green} icu in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
59s{color} | {color:green} kuromoji in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 46m 35s{color} 
| {color:red} core in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 59m 21s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | solr.schema.MultiTermTest |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-8497 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946072/LUCENE-8497.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon 
Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / e618e83 |
| ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 |
| Default Java | 1.8.0_172 |
| unit | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/113/artifact/out/patch-unit-solr_core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/113/testReport/ |
| modules | C: lucene/analysis/common lucene/analysis/icu 
lucene/analysis/kuromoji solr/core U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/113/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-10-29 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667484#comment-16667484
 ] 

Alan Woodward commented on LUCENE-8497:
---

Updated patch.  I've enabled the QA bot because I can't get the Solr tests to 
pass locally, but I think I've chased down all the test failures that were due 
to this change.

> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch, LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-10-17 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654550#comment-16654550
 ] 

Erick Erickson commented on LUCENE-8497:


For crying out loud, I wrote that 11 years ago and you expect me to remember 
why ;)

OK, I'll get serious now. The origin of MultTermAware but was to allow us to 
apply some filters for wildcard queries, it all started with LowerCaseFilter. I 
got really tired of explaining to users that "Sol*" didn't find "solr" because 
terms with wildcards were unanalyzed. As long as that behavior is retained that 
test can be removed for all of me. It's pretty out of date, it only verifies 
that a few of the filters that implement that interface anyway.

So remove it if you see fit. A more effective test of the behavior I care about 
would be determining if all the filters that implement that interface properly 
work with, say, wildcards in the search term.

> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch, LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

2018-10-16 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651403#comment-16651403
 ] 

Alan Woodward commented on LUCENE-8497:
---

Thanks for the pull request [~mayyas]!  I've extended it to cover Solr as well, 
which allows us to remove MultiTermAwareComponent entirely.

This causes some test failures in Solr's MultiTermTest, but seeing as these 
explicitly testing implementation (which has now changed), and the behaviour 
itself is tested elsewhere in eg TestFoldingMultitermQuery I think we should be 
OK to just remove this test?  [~erickerickson] you wrote the tests originally, 
does that sound reasonable to you?

> Rethink multi-term analysis handling
> 
>
> Key: LUCENE-8497
> URL: https://issues.apache.org/jira/browse/LUCENE-8497
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8497.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org