[ 
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350785#comment-15350785
 ] 

Ian Boston commented on OAK-4412:
---------------------------------

Thank you for the explanation of the detail. It confirms what I understood from 
the description, when I posted the information.

The approach here is very similar to what others have tried in the past. The 
problems they documented were:

1. No increase in index update throughput or reduction in data latency for a 
cluster consistent index. Still single threaded cluster sequential.
2. Reduction in data latency for changes isolated to one node, mitigated by 
sticky http sessions, which are/were considered an indication of a non scalable 
platform.
3. Application issues for searches performed by request threads expecting 
uniform data latency across the cluster.
4. Application issues for searches performed by non request threads seeing 
local index changes inconsistently across the cluster.
5. Added complexity resulting in lower reliability.

For those that tried, not delivering any improvement in data latency for a 
cluster consistent index was the biggest reason they abandoned the approach. 
Some reported 3,4,5 to be blockers requiring applications to do special things 
(eg querying the DB layer for results unexpectedly missing).

Unfortunately, the detailed reports referenced above were from 5 years ago, so 
I am unlikely to be able to find them. Those reports mirror the experiences 
shared by Shay Banon in some of his early tech talks. eg Berlin Buzz 2011 
https://vimeo.com/26710663 where he described in detail many different methods 
to implementing distributed Lucene. Current oak is described at 18:34. I think 
Solr Cloud is described at 22:49, but I could be wrong about its commit 
mechanism (IIRC from Lucene4 it ships segments but might not require hard 
commits). This enhancement is a form of distributed directory described at 
06:37.  I share the link because Shay Banon has successfully built a technology 
business over perhaps 15 years based on distributed real time lucene search, 
not because I am suggesting you should directly use ElasticSearch.

HTH





> Lucene hybrid index
> -------------------
>
>                 Key: OAK-4412
>                 URL: https://issues.apache.org/jira/browse/OAK-4412
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Tomek Rękawek
>            Assignee: Tomek Rękawek
>             Fix For: 1.6
>
>         Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After 
> performing some stress-tests with a geo-distributed Mongo cluster, we've 
> found out that updating property indexes is a large part of the overall 
> traffic.
> The asynchronous index would be an answer here (as the index update won't be 
> made in the client request thread), but the AEM requires the updates to be 
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a 
> synchronous, locally-stored counterpart that will persist only the data since 
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local 
> files. Once the "main" Lucene index is being updated, the local index will be 
> purged.
> Queries will use an union of results from the {{lucene}} and 
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using 
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the 
> OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to