[jira] [Commented] (OAK-4412) Lucene hybrid index

Chetan Mehrotra (JIRA) Tue, 02 Aug 2016 22:31:17 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405340#comment-15405340
 ]


Chetan Mehrotra commented on OAK-4412:
--------------------------------------

h3. Approach B - Lucene editor used both in async and sync mode

In current approach which make use of an Observer to update the local transient 
index most of the work is being done in a single indexer thread which would be 
doing
# Diff of the changes nodestates
# Make out Lucene Document based on changes done
# Add the documents to the index

This might cause the indexer to again lag the current head depending on amount 
of write happening. Instead of that we can change the approach and break the 
work in 2 parts

*Step 1 - LuceneIndexEditor used in sync mode*

We can move out the work done in #1 and #2 above to LuceneIndexEditor which 
would be invoked in synchronous manner [0] as part of normal commit (similar to 
how current property index editors are invoked). This editor would be backed by 
a different {{LuceneIndexWriter}} impl which would add the Documents to the 
CommitInfo associated with current commit [1] instead of adding them directly 
to the index.

This would thus parallize the expensive task of diffing and constructing the 
Lucene Document from actual indexing (which is by design single threaded).

*Step 2 - Async local indexer*

To compliment the editor there would be an observer which listens for the 
changes. This observer

* For local changes would extract the Documents (prepared and added to 
CommitInfo associated with change per previous step) and add them to queue for 
the writer of the matching index
* For external change it would run the editor and do the diff and prepare the 
document and add them to the queue for the respective writer. 

Note that work here can be done on best effort basis - So if it takes time the 
indexer can "drop" documents or say avoid doing indexing altogether for 
external diff. Those aspects can be exposed for tuning.

*Step 3 - NRT Reader on Query Side*

On query side we would construct the reader from existing writer itself 
utilizing the [Lucene NRT 
support|https://wiki.apache.org/lucene-java/NearRealtimeSearch] and with 
updated support for MultiReader (done as part of OAK-4566) the query logic 
would also consider this reader for any query evaluation. This would ensure 
that query get to see most recent results.

We can utilize all aspects for NRT (like skipping doing deletes as query engine 
would filter out false results)

*Step 4 - Pruning of transient index*

Further we would need to periodically prune the transient indexes. This can be 
done by deleting those documents which are older than last async index update 
cycle. So with each async index update we can say that repository is indexed 
upto the time when async index update was started. So we can use that time and 
remove those documents from index which are older than 2 cycles.
*Points to consider*

# Text extraction would be disabled for such transient indexing
# All this would be done on best effort basis. Note that even if index has some 
stale data the QE would still evaluate and enforce the query constraint [2] and 
would filter out wrong results.
# Each such transient index would be backed by FSDirectory. The FSDirectory 
would be cleaned upon restart
# The editors need to ignore the reindex calls etc

[0] This would require change in current indexing logic where a given index 
definition can only be used in either sync or in async mode but not in both
[1] CommitInfo is currently not accessible to index editors. So this would need 
to be changed
[2] Fulltext constraint would not be evaluated though. Do note that primary 
focus for such hybrid index is property index

> Lucene hybrid index
> -------------------
>
>                 Key: OAK-4412
>                 URL: https://issues.apache.org/jira/browse/OAK-4412
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Tomek Rękawek
>            Assignee: Tomek Rękawek
>             Fix For: 1.6
>
>         Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After 
> performing some stress-tests with a geo-distributed Mongo cluster, we've 
> found out that updating property indexes is a large part of the overall 
> traffic.
> The asynchronous index would be an answer here (as the index update won't be 
> made in the client request thread), but the AEM requires the updates to be 
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a 
> synchronous, locally-stored counterpart that will persist only the data since 
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local 
> files. Once the "main" Lucene index is being updated, the local index will be 
> purged.
> Queries will use an union of results from the {{lucene}} and 
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using 
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the 
> OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4412) Lucene hybrid index

Reply via email to