[jira] [Comment Edited] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Fabrizio Fortino (Jira) Tue, 02 Feb 2021 07:58:51 -0800


    [ 
https://issues.apache.org/jira/browse/OAK-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277216#comment-17277216
 ]


Fabrizio Fortino edited comment on OAK-9145 at 2/2/21, 3:57 PM:
----------------------------------------------------------------

I have created a draft [PR|https://github.com/apache/jackrabbit-oak/pull/266] 
changing the order of the filters for both elastic and lucene.

For elastic, there are no other changes to do. The provided unit test passes 
successfully.

For lucene, I had to add the flag WordDelimiterFilter.SPLIT_ON_CASE_CHANGE. 
Without it, the provided unit test does not pass (it basically does not change 
the way terms are generated by the analyzer). In OakAnalyzer we override the 
default flags. The SPLIT_ON_CASE_CHANGE, which by default would be true, needs 
to be passed explicitly.

The problem with this is that a number of tests fail. It seems that not 
including the camel case split was made on purpose. As an example:

[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-search/src/test/java/org/apache/jackrabbit/oak/plugins/index/IndexQueryCommonTest.java#L352-L365]

would fail because the query would return 2 results but only 1 is expected.

I am not sure about the reasons OAKAnalyzer has been configured in this way. I 
think that changing it, although it seems reasonable, could have undesidered 
effects on some queries.

[~thomasm] [~dave.l.hughes] any feedback?


was (Author: fortino):
I have created a draft [PR|https://github.com/apache/jackrabbit-oak/pull/266] 
changing the order of the filters for both elastic and lucene.

For elastic, there are no other changes to do. The provided unit test passes 
successfully.

For lucene, I had to add the flag WordDelimiterFilter.SPLIT_ON_CASE_CHANGE. 
Without it, the provided unit test does not pass. In OakAnalyzer we override 
the default flags. The SPLIT_ON_CASE_CHANGE, which by default would be true, 
needs to be passed explicitly.

The problem with this is that a number of tests fail. It seems that not 
including the camel case split was made on purpose. As an example:

[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-search/src/test/java/org/apache/jackrabbit/oak/plugins/index/IndexQueryCommonTest.java#L352-L365]

would fail because the query would return 2 results but only 1 is expected.

I am not sure about the reasons OAKAnalyzer has been configured in this way. I 
think that changing it, although it seems reasonable, could have undesidered 
effects on some queries.

[~thomasm] [~dave.l.hughes] any feedback?

> OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order
> --------------------------------------------------------------------------
>
>                 Key: OAK-9145
>                 URL: https://issues.apache.org/jira/browse/OAK-9145
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: indexing, jcr, lucene
>         Environment: Discovered while performing DAM searches in Adobe 
> Experience Manager. 
>            Reporter: Dave Hughes
>            Assignee: Fabrizio Fortino
>            Priority: Minor
>              Labels: easyfix, pull-request-available
>
> I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the 
> wrong order.  WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS 
> flag, which splits camelCase/PascalCase into multiple terms, but since the 
> LowerCaseFilter is applied first, the mixed-case is lost and the terms can't 
> be split.
> Searching for savings, the damAssetLucene index (which uses the default 
> OakAnalyzer) does not find an asset named savingsAccount.svg.
> Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers) 
> to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour 
> was seen.
> {noformat}
> {
>   "jcr:primaryType": "nt:unstructured",
>   "default": {
>     "jcr:primaryType": "nt:unstructured",
>     "tokenizer": {
>       "jcr:primaryType": "nt:unstructured",
>       "name": "Standard"
>     },
>     "filters": {
>       "jcr:primaryType": "nt:unstructured",
>       "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
>       "LowerCase": {"jcr:primaryType": "nt:unstructured"}
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Reply via email to