[jira] [Updated] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Thomas Mueller (Jira) Sun, 22 Nov 2020 23:34:33 -0800


     [ 
https://issues.apache.org/jira/browse/OAK-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Mueller updated OAK-9145:
--------------------------------
    Description: 
I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the 
wrong order.  WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS flag, 
which splits camelCase/PascalCase into multiple terms, but since the 
LowerCaseFilter is applied first, the mixed-case is lost and the terms can't be 
split.

Searching for savings, the damAssetLucene index (which uses the default 
OakAnalyzer) does not find an asset named savingsAccount.svg.

Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers) to 
apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour was 
seen.

{noformat}
{
  "jcr:primaryType": "nt:unstructured",
  "default": {
    "jcr:primaryType": "nt:unstructured",
    "tokenizer": {
      "jcr:primaryType": "nt:unstructured",
      "name": "Standard"
    },
    "filters": {
      "jcr:primaryType": "nt:unstructured",
      "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
      "LowerCase": {"jcr:primaryType": "nt:unstructured"}
    }
  }
}
{noformat}


  was:I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in 
the wrong order.  WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS 
flag, which splits camelCase/PascalCase into multiple terms, but since the 
LowerCaseFilter is applied first, the mixed-case is lost and the terms can't be 
split.


> OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order
> --------------------------------------------------------------------------
>
>                 Key: OAK-9145
>                 URL: https://issues.apache.org/jira/browse/OAK-9145
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: indexing, jcr, lucene
>         Environment: Discovered while performing DAM searches in Adobe 
> Experience Manager. 
> Searching for _savings_, the damAssetLucene index (which uses the default 
> OakAnalyzer) does not find an asset named _savingsAccount.svg_.
> Upon configuring the index's analyzers 
> (_/oak:index/damAssetLucene/analyzers_) to apply WordDelimiterFilter before 
> LowerCaseFilter, the correct behaviour was seen.
> {noformat}
> {
>   "jcr:primaryType": "nt:unstructured",
>   "default": {
>     "jcr:primaryType": "nt:unstructured",
>     "tokenizer": {
>       "jcr:primaryType": "nt:unstructured",
>       "name": "Standard"
>     },
>     "filters": {
>       "jcr:primaryType": "nt:unstructured",
>       "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
>       "LowerCase": {"jcr:primaryType": "nt:unstructured"}
>     }
>   }
> }
> {noformat}
>            Reporter: Dave Hughes
>            Assignee: Thomas Mueller
>            Priority: Minor
>              Labels: easyfix, pull-request-available
>
> I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the 
> wrong order.  WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS 
> flag, which splits camelCase/PascalCase into multiple terms, but since the 
> LowerCaseFilter is applied first, the mixed-case is lost and the terms can't 
> be split.
> Searching for savings, the damAssetLucene index (which uses the default 
> OakAnalyzer) does not find an asset named savingsAccount.svg.
> Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers) 
> to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour 
> was seen.
> {noformat}
> {
>   "jcr:primaryType": "nt:unstructured",
>   "default": {
>     "jcr:primaryType": "nt:unstructured",
>     "tokenizer": {
>       "jcr:primaryType": "nt:unstructured",
>       "name": "Standard"
>     },
>     "filters": {
>       "jcr:primaryType": "nt:unstructured",
>       "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
>       "LowerCase": {"jcr:primaryType": "nt:unstructured"}
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Reply via email to