[
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405340#comment-15405340
]
Chetan Mehrotra commented on OAK-4412:
--------------------------------------
h3. Approach B - Lucene editor used both in async and sync mode
In current approach which make use of an Observer to update the local transient
index most of the work is being done in a single indexer thread which would be
doing
# Diff of the changes nodestates
# Make out Lucene Document based on changes done
# Add the documents to the index
This might cause the indexer to again lag the current head depending on amount
of write happening. Instead of that we can change the approach and break the
work in 2 parts
*Step 1 - LuceneIndexEditor used in sync mode*
We can move out the work done in #1 and #2 above to LuceneIndexEditor which
would be invoked in synchronous manner [0] as part of normal commit (similar to
how current property index editors are invoked). This editor would be backed by
a different {{LuceneIndexWriter}} impl which would add the Documents to the
CommitInfo associated with current commit [1] instead of adding them directly
to the index.
This would thus parallize the expensive task of diffing and constructing the
Lucene Document from actual indexing (which is by design single threaded).
*Step 2 - Async local indexer*
To compliment the editor there would be an observer which listens for the
changes. This observer
* For local changes would extract the Documents (prepared and added to
CommitInfo associated with change per previous step) and add them to queue for
the writer of the matching index
* For external change it would run the editor and do the diff and prepare the
document and add them to the queue for the respective writer.
Note that work here can be done on best effort basis - So if it takes time the
indexer can "drop" documents or say avoid doing indexing altogether for
external diff. Those aspects can be exposed for tuning.
*Step 3 - NRT Reader on Query Side*
On query side we would construct the reader from existing writer itself
utilizing the [Lucene NRT
support|https://wiki.apache.org/lucene-java/NearRealtimeSearch] and with
updated support for MultiReader (done as part of OAK-4566) the query logic
would also consider this reader for any query evaluation. This would ensure
that query get to see most recent results.
We can utilize all aspects for NRT (like skipping doing deletes as query engine
would filter out false results)
*Step 4 - Pruning of transient index*
Further we would need to periodically prune the transient indexes. This can be
done by deleting those documents which are older than last async index update
cycle. So with each async index update we can say that repository is indexed
upto the time when async index update was started. So we can use that time and
remove those documents from index which are older than 2 cycles.
*Points to consider*
# Text extraction would be disabled for such transient indexing
# All this would be done on best effort basis. Note that even if index has some
stale data the QE would still evaluate and enforce the query constraint [2] and
would filter out wrong results.
# Each such transient index would be backed by FSDirectory. The FSDirectory
would be cleaned upon restart
# The editors need to ignore the reindex calls etc
[0] This would require change in current indexing logic where a given index
definition can only be used in either sync or in async mode but not in both
[1] CommitInfo is currently not accessible to index editors. So this would need
to be changed
[2] Fulltext constraint would not be evaluated though. Do note that primary
focus for such hybrid index is property index
> Lucene hybrid index
> -------------------
>
> Key: OAK-4412
> URL: https://issues.apache.org/jira/browse/OAK-4412
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Reporter: Tomek Rękawek
> Assignee: Tomek Rękawek
> Fix For: 1.6
>
> Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After
> performing some stress-tests with a geo-distributed Mongo cluster, we've
> found out that updating property indexes is a large part of the overall
> traffic.
> The asynchronous index would be an answer here (as the index update won't be
> made in the client request thread), but the AEM requires the updates to be
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a
> synchronous, locally-stored counterpart that will persist only the data since
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local
> files. Once the "main" Lucene index is being updated, the local index will be
> purged.
> Queries will use an union of results from the {{lucene}} and
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the
> OAK-4233.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)