[
https://issues.apache.org/jira/browse/OAK-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277216#comment-17277216
]
Fabrizio Fortino commented on OAK-9145:
---------------------------------------
I have created a draft [PR|https://github.com/apache/jackrabbit-oak/pull/266]
changing the order of the filters for both elastic and lucene.
For elastic, there are no other changes to do. The provided unit test passes
successfully.
For lucene, I had to add the flag WordDelimiterFilter.SPLIT_ON_CASE_CHANGE.
Without it, the provided unit test does not pass. In OakAnalyzer we override
the default flags. The SPLIT_ON_CASE_CHANGE, which by default would be true,
needs to be passed explicitly.
The problem with this is that a number of tests fail. It seems that not
including the camel case split was made on purpose. As an example:
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-search/src/test/java/org/apache/jackrabbit/oak/plugins/index/IndexQueryCommonTest.java#L352-L365]
would fail because the query would return 2 results but only 1 is expected.
I am not sure about the reasons OAKAnalyzer has been configured in this way. I
think that changing it, although it seems reasonable, could have undesidered
effects on some queries.
[~thomasm] [~dave.l.hughes] any feedback?
> OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order
> --------------------------------------------------------------------------
>
> Key: OAK-9145
> URL: https://issues.apache.org/jira/browse/OAK-9145
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: indexing, jcr, lucene
> Environment: Discovered while performing DAM searches in Adobe
> Experience Manager.
> Reporter: Dave Hughes
> Assignee: Fabrizio Fortino
> Priority: Minor
> Labels: easyfix, pull-request-available
>
> I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the
> wrong order. WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS
> flag, which splits camelCase/PascalCase into multiple terms, but since the
> LowerCaseFilter is applied first, the mixed-case is lost and the terms can't
> be split.
> Searching for savings, the damAssetLucene index (which uses the default
> OakAnalyzer) does not find an asset named savingsAccount.svg.
> Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers)
> to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour
> was seen.
> {noformat}
> {
> "jcr:primaryType": "nt:unstructured",
> "default": {
> "jcr:primaryType": "nt:unstructured",
> "tokenizer": {
> "jcr:primaryType": "nt:unstructured",
> "name": "Standard"
> },
> "filters": {
> "jcr:primaryType": "nt:unstructured",
> "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
> "LowerCase": {"jcr:primaryType": "nt:unstructured"}
> }
> }
> }
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)