[jira] [Commented] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Fabrizio Fortino (Jira) Thu, 04 Feb 2021 00:55:06 -0800


    [ 
https://issues.apache.org/jira/browse/OAK-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278672#comment-17278672
 ]


Fabrizio Fortino commented on OAK-9145:
---------------------------------------

I am not sure if having a version 2 would bring some value since we already 
provide a way to change the default analyzer (as also described in this Jira 
description).

This would introduce more complexity in the configuration to cover a specific 
need. The full-text analyzer configuration is very subjective and could change 
case by case.

I am not sure about the reasons the current analyzer is configured as it is. 
Beyond the fact, snake case text gets tokenized while camel case doesn't, I see 
other inconsistencies.

As an example, English possessive gets stemmed but there is no standard English 
stemming configured (especially useful for singular-plural).

With that said, I propose to close this issue without merging the PR. This was 
anyway useful to discover differences between lucene and elastic analyzer (I 
will create a separate issue to fix that).

 

> OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order
> --------------------------------------------------------------------------
>
>                 Key: OAK-9145
>                 URL: https://issues.apache.org/jira/browse/OAK-9145
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: indexing, jcr, lucene
>         Environment: Discovered while performing DAM searches in Adobe 
> Experience Manager. 
>            Reporter: Dave Hughes
>            Assignee: Fabrizio Fortino
>            Priority: Minor
>              Labels: easyfix, pull-request-available
>
> I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the 
> wrong order.  WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS 
> flag, which splits camelCase/PascalCase into multiple terms, but since the 
> LowerCaseFilter is applied first, the mixed-case is lost and the terms can't 
> be split.
> Searching for savings, the damAssetLucene index (which uses the default 
> OakAnalyzer) does not find an asset named savingsAccount.svg.
> Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers) 
> to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour 
> was seen.
> {noformat}
> {
>   "jcr:primaryType": "nt:unstructured",
>   "default": {
>     "jcr:primaryType": "nt:unstructured",
>     "tokenizer": {
>       "jcr:primaryType": "nt:unstructured",
>       "name": "Standard"
>     },
>     "filters": {
>       "jcr:primaryType": "nt:unstructured",
>       "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
>       "LowerCase": {"jcr:primaryType": "nt:unstructured"}
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

Reply via email to