[
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467219#comment-15467219
]
Chetan Mehrotra commented on OAK-4412:
--------------------------------------
Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for
review.
h3. Purpose
Hybrid index provides 2 indexing modes
h4. nrt
In this mode for each commit Lucene Documents would be created as part of sync
commit and would be added to a *local* index asynchronously where the
IndexReader would be refreshed with _refresh interval_ of 1 sec
h4. sync
In this mode the lucene document would be added to index and IndexReader would
be *immediately* refreshed. Functionally this would be similar to property
index. This mode has lower performance compared to {{nrt}}.
This mode should be used for those cases where code expects changes made to
session immediately reflected in the query. So if a session set _/a/b/@foo_ to
_bar_ and just after session save performs a query for 'bar' and expects
/a/n/@foo to be part of result set then this mode should be used.
Performance wise this mode is slower and slows down writes compared to 'nrt'
The indexes created under hybrid index are local and maintain index data
between last async index cycle to most recent commit. Any search would be
performed via MultiReader with readers from local index and another from index
built as part of async indexing.
h3. Usage
To enable this mode for any index you need to make the {{async}} property as a
multi value property with following values
* {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode
* {{async}} = [{{async}}, {{sync}}] - Enables the sync mode
{{LuceneIndexProviderService}} - Provides some tuning configuration which can
be modfied as per setup requirements
h4. Implementation Detail
Most of the new code lives under
{{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}} package. For any
commit involving any index definition marked with {{nrt}} or {{sync}}
{{LuceneIndexEditorProvider}} would return a {{LuceneIndexEditor}} backed by
{{LocalIndexWriterFactory}}. This factory would use {{LocalIndexWriter}} and
stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}. This holder
instance is stored as part of {{CommitContext}} (which is stored in
{{CommitInfo}} associated with the commit).
Once merge is done for that commit the change is picked by
{{LocalIndexObserver}} (a sync observer). This observer would then look for
{{LuceneDocumentHolder}} and if found would process the {{LuceneDoc}} stored in
it
* For documents belonging to {{nrt}} mode it would add the docs to
{{DocumentQueue}}
* For documents belonging ti {{sync}} mode it would directly write the document
to {{NRTIndex}} configured for that index
{{DocumentQueue}} asynchronously picks up the docs from the queue and then
write them to the index.
*NRTIndex*
On indexing side each index (represented by {{IndexNode}}) has a matching
{{NRTIndex}} which is constructed from {{NRTIndexFactory}}. Whenever a new
{{IndexNode}} instance is created as a result of change in async index (via
{{IndexTracker}}) the factory would create a new {{NRTIndex}} for that. It
keeps maximum 2 instance of {{NRTIndex}} and closes and garbage collect older
onces. So a {{NRTIndex}} would only have index data for the data indexed
between 2 consecutive async indexing cycle.
{{NRTIndex}} provides access to {{IndexWriter}} which is used by
{{DocumentQueue}} to write documents to it. It also creates {{IndexReader}}
which is obtained from {{IndexWriter}} making use of [Lucene NRT
Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch]
{{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines
how and when the reader should be refreshed. The policy instance is also made
aware of the changes done to index. For {{nrt}} indexes {{TimedRefreshPolicy}}
is used which by default refreshes the reader after 1 sec delay. For {{sync}}
index {{RefreshOnWritePolicy}} is used which refreshes the reader after any
writes
h4. Benchmark
A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It
creates multiple indexes (_numOfIndexes_ = 10) to simulate a system having
multiple indexes defined and then creates node with property {{foo}} being set
with value as per enum _Status_. Each thread then creates nodes in breadth
first fashion (defaults to 5 child node per node and then for each child node).
In addition there is a {{Searcher}} thread which queries for different values
and a {{Mutator}} which modifies the values
* refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt
* asyncInterval - 5 - Time in seconds for async indexer
* queueSize - 1000 - Size of queue used by {{DocumentQueue}}
* hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used
otherwise property index would be used
* indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if
hybridIndexEnabled
* useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid
compression which slows down the searches (OAK-1737)
{noformat}
java -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark
--concurrency=5 HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS
{noformat}
_Results would be posted soon_
h4. Pending Feature Work
* Support for listening to external changes and then update the {{nrt}} indexes
based on those changes
* JMX MBean around NRTIndexFactory to see rate of change etc
> Lucene hybrid index
> -------------------
>
> Key: OAK-4412
> URL: https://issues.apache.org/jira/browse/OAK-4412
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: lucene
> Reporter: Tomek Rękawek
> Assignee: Chetan Mehrotra
> Fix For: 1.6
>
> Attachments: OAK-4412-v1.diff, OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After
> performing some stress-tests with a geo-distributed Mongo cluster, we've
> found out that updating property indexes is a large part of the overall
> traffic.
> The asynchronous index would be an answer here (as the index update won't be
> made in the client request thread), but the AEM requires the updates to be
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a
> synchronous, locally-stored counterpart that will persist only the data since
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local
> files. Once the "main" Lucene index is being updated, the local index will be
> purged.
> Queries will use an union of results from the {{lucene}} and
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the
> OAK-4233.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)