[jira] [Comment Edited] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Dave Hughes (Jira) Tue, 24 Nov 2020 10:55:36 -0800


    [ 
https://issues.apache.org/jira/browse/OAK-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238274#comment-17238274
 ]


Dave Hughes edited comment on OAK-9145 at 11/24/20, 6:54 PM:
-------------------------------------------------------------

Thank you [~thomasm], much appreciated.

In response to the problems you mentioned in your last comment:
 1. Yes, I considered the backwards compatibility issue.  In my mind, this 
fixes a bug, and I don't generally put a lot of value on preserving backwards 
compatibility of bugs.  But I also understand that this is a widely used 
project and many consumers may, possibly, depend on the current incorrect 
functionality.

I'm not sure I follow your suggestion to use a different version number.  Are 
you proposing that this fix should wait for the next major version release of 
Oak?  Or that this should become a new (non-default) OakAnalyzer2, in order to 
provide an analyzer which works correctly, but would have to be manually 
selected by consumers?  If the latter, I feel like it defeats the purpose of 
this bugfix, since the effort to switch to the corrected OakAnalyzer2 would be 
practically identical to manually configuring the analyzer's filter chain (the 
workaround that I described).


 2. As I commented on the Github PR, I did try to create a test case for this, 
but I struggled a lot.  As you mentioned in your earlier comment, "Not sure 
where to put it best".  I think the problem is largely that the OakAnalyzer has 
not been adequately tested, which is why there's no obvious place to put the 
new test case.  I would greatly appreciate if any other contributors are more 
familiar with the code base and could help in adding testing around the 
OakAnalyzer class.


was (Author: dave.l.hughes):
Thank you [~thomasm], much appreciated.

In response to the problems you mentioned in your last comment:
 # Yes, I considered the backwards compatibility issue.  In my mind, this fixes 
a bug, and I don't generally put a lot of value on preserving backwards 
compatibility of bugs.  But I also understand that this is a widely used 
project and many consumers may, possibly, depend on the current incorrect 
functionality.

I'm not sure I follow your suggestion to use a different version number.  Are 
you proposing that this fix should wait for the next major version release of 
Oak?  Or that this should become a new (non-default) OakAnalyzer2, in order to 
provide an analyzer which works correctly, but would have to be manually 
selected by consumers?  If the latter, I feel like it defeats the purpose of 
this bugfix, since the effort to switch to the corrected OakAnalyzer2 would be 
practically identical to manually configuring the analyzer's filter chain (the 
workaround that I described).


 # As I commented on the Github PR, I did try to create a test case for this, 
but I struggled a lot.  As you mentioned in your earlier comment, "Not sure 
where to put it best".  I think the problem is largely that the OakAnalyzer has 
not been adequately tested, which is why there's no obvious place to put the 
new test case.  I would greatly appreciate if any other contributors are more 
familiar with the code base and could help in adding testing around the 
OakAnalyzer class.

> OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order
> --------------------------------------------------------------------------
>
>                 Key: OAK-9145
>                 URL: https://issues.apache.org/jira/browse/OAK-9145
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: indexing, jcr, lucene
>         Environment: Discovered while performing DAM searches in Adobe 
> Experience Manager. 
>            Reporter: Dave Hughes
>            Assignee: Thomas Mueller
>            Priority: Minor
>              Labels: easyfix, pull-request-available
>
> I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the 
> wrong order.  WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS 
> flag, which splits camelCase/PascalCase into multiple terms, but since the 
> LowerCaseFilter is applied first, the mixed-case is lost and the terms can't 
> be split.
> Searching for savings, the damAssetLucene index (which uses the default 
> OakAnalyzer) does not find an asset named savingsAccount.svg.
> Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers) 
> to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour 
> was seen.
> {noformat}
> {
>   "jcr:primaryType": "nt:unstructured",
>   "default": {
>     "jcr:primaryType": "nt:unstructured",
>     "tokenizer": {
>       "jcr:primaryType": "nt:unstructured",
>       "name": "Standard"
>     },
>     "filters": {
>       "jcr:primaryType": "nt:unstructured",
>       "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
>       "LowerCase": {"jcr:primaryType": "nt:unstructured"}
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Reply via email to