[jira] [Updated] (OAK-11568) Elastic: improved compatibility for analyzer definitions

Thomas Mueller (Jira) Mon, 31 Mar 2025 05:55:25 -0700


     [ 
https://issues.apache.org/jira/browse/OAK-11568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Mueller updated OAK-11568:
---------------------------------
    Description: 
Currently, analyzer definitions for Lucene indexes are not fully compatible 
with Elasticsearch. I guess we don't need 100% compatibility, but we should 
still improve it. I have a few cases that are not currently supported, and I 
hope supporting them is possible.

I found that with recent versions of Elasticsearch, the synonym filter can not 
be combined with the word delimiter filter, or the word delimiter graph. Doing 
so can easily result in "IllegalStateException: startOffset must be 
non-negative, and endOffset must be >= startOffset, and offsets must not go 
backwards" when closing the ElasticIndexer object. I tried many combinations, 
but none of them worked:

* Tokenizer: Standard; filters: LowerCase, WordDelimiter, Synonym, PorterStem
* Tokenizer: Standard; filters: LowerCase, WordDelimiter graph, Synonym, 
PorterStem (replacing with graph)
* Tokenizer: Standard; filters: LowerCase, Synonym, WordDelimiter graph, 
PorterStem (reordering)
* Tokenizer: None; filters: LowerCase, Synonym, WordDelimiter graph, PorterStem 
(no tokenizer)
* Filters: Synonym, WordDelimiter graph (minimum)

However, _just_ using the Synonym filter, OR the WordDelimiter graph, always 
worked in my tests.

  was:Currently, analyzer definitions for Lucene indexes are not fully 
compatible with Elasticsearch. I guess we don't need 100% compatibility, but we 
should still improve it. I have a few cases that are not currently supported, 
and I hope supporting them is possible.


> Elastic: improved compatibility for analyzer definitions
> --------------------------------------------------------
>
>                 Key: OAK-11568
>                 URL: https://issues.apache.org/jira/browse/OAK-11568
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: elastic-search
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>            Priority: Major
>
> Currently, analyzer definitions for Lucene indexes are not fully compatible 
> with Elasticsearch. I guess we don't need 100% compatibility, but we should 
> still improve it. I have a few cases that are not currently supported, and I 
> hope supporting them is possible.
> I found that with recent versions of Elasticsearch, the synonym filter can 
> not be combined with the word delimiter filter, or the word delimiter graph. 
> Doing so can easily result in "IllegalStateException: startOffset must be 
> non-negative, and endOffset must be >= startOffset, and offsets must not go 
> backwards" when closing the ElasticIndexer object. I tried many combinations, 
> but none of them worked:
> * Tokenizer: Standard; filters: LowerCase, WordDelimiter, Synonym, PorterStem
> * Tokenizer: Standard; filters: LowerCase, WordDelimiter graph, Synonym, 
> PorterStem (replacing with graph)
> * Tokenizer: Standard; filters: LowerCase, Synonym, WordDelimiter graph, 
> PorterStem (reordering)
> * Tokenizer: None; filters: LowerCase, Synonym, WordDelimiter graph, 
> PorterStem (no tokenizer)
> * Filters: Synonym, WordDelimiter graph (minimum)
> However, _just_ using the Synonym filter, OR the WordDelimiter graph, always 
> worked in my tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (OAK-11568) Elastic: improved compatibility for analyzer definitions

Reply via email to