[
https://issues.apache.org/jira/browse/OAK-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238274#comment-17238274
]
Dave Hughes edited comment on OAK-9145 at 11/24/20, 6:54 PM:
-------------------------------------------------------------
Thank you [~thomasm], much appreciated.
In response to the problems you mentioned in your last comment:
1. Yes, I considered the backwards compatibility issue. In my mind, this
fixes a bug, and I don't generally put a lot of value on preserving backwards
compatibility of bugs. But I also understand that this is a widely used
project and many consumers may, possibly, depend on the current incorrect
functionality.
I'm not sure I follow your suggestion to use a different version number. Are
you proposing that this fix should wait for the next major version release of
Oak? Or that this should become a new (non-default) OakAnalyzer2, in order to
provide an analyzer which works correctly, but would have to be manually
selected by consumers? If the latter, I feel like it defeats the purpose of
this bugfix, since the effort to switch to the corrected OakAnalyzer2 would be
practically identical to manually configuring the analyzer's filter chain (the
workaround that I described).
2. As I commented on the Github PR, I did try to create a test case for this,
but I struggled a lot. As you mentioned in your earlier comment, "Not sure
where to put it best". I think the problem is largely that the OakAnalyzer has
not been adequately tested, which is why there's no obvious place to put the
new test case. I would greatly appreciate if any other contributors are more
familiar with the code base and could help in adding testing around the
OakAnalyzer class.
was (Author: dave.l.hughes):
Thank you [~thomasm], much appreciated.
In response to the problems you mentioned in your last comment:
# Yes, I considered the backwards compatibility issue. In my mind, this fixes
a bug, and I don't generally put a lot of value on preserving backwards
compatibility of bugs. But I also understand that this is a widely used
project and many consumers may, possibly, depend on the current incorrect
functionality.
I'm not sure I follow your suggestion to use a different version number. Are
you proposing that this fix should wait for the next major version release of
Oak? Or that this should become a new (non-default) OakAnalyzer2, in order to
provide an analyzer which works correctly, but would have to be manually
selected by consumers? If the latter, I feel like it defeats the purpose of
this bugfix, since the effort to switch to the corrected OakAnalyzer2 would be
practically identical to manually configuring the analyzer's filter chain (the
workaround that I described).
# As I commented on the Github PR, I did try to create a test case for this,
but I struggled a lot. As you mentioned in your earlier comment, "Not sure
where to put it best". I think the problem is largely that the OakAnalyzer has
not been adequately tested, which is why there's no obvious place to put the
new test case. I would greatly appreciate if any other contributors are more
familiar with the code base and could help in adding testing around the
OakAnalyzer class.
> OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order
> --------------------------------------------------------------------------
>
> Key: OAK-9145
> URL: https://issues.apache.org/jira/browse/OAK-9145
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: indexing, jcr, lucene
> Environment: Discovered while performing DAM searches in Adobe
> Experience Manager.
> Reporter: Dave Hughes
> Assignee: Thomas Mueller
> Priority: Minor
> Labels: easyfix, pull-request-available
>
> I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the
> wrong order. WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS
> flag, which splits camelCase/PascalCase into multiple terms, but since the
> LowerCaseFilter is applied first, the mixed-case is lost and the terms can't
> be split.
> Searching for savings, the damAssetLucene index (which uses the default
> OakAnalyzer) does not find an asset named savingsAccount.svg.
> Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers)
> to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour
> was seen.
> {noformat}
> {
> "jcr:primaryType": "nt:unstructured",
> "default": {
> "jcr:primaryType": "nt:unstructured",
> "tokenizer": {
> "jcr:primaryType": "nt:unstructured",
> "name": "Standard"
> },
> "filters": {
> "jcr:primaryType": "nt:unstructured",
> "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
> "LowerCase": {"jcr:primaryType": "nt:unstructured"}
> }
> }
> }
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)