[jira] [Comment Edited] (OAK-4412) Lucene hybrid index

Chetan Mehrotra (JIRA) Tue, 06 Sep 2016 05:21:47 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467219#comment-15467219
 ]


Chetan Mehrotra edited comment on OAK-4412 at 9/6/16 12:21 PM:
---------------------------------------------------------------

Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for 
review.

h3. A - Purpose

Hybrid index provides 2 indexing modes

h4. nrt
In this mode for each commit Lucene Documents would be created as part of sync 
commit and would be added to a *local* index asynchronously where the 
IndexReader would be refreshed with _refresh interval_ of 1 sec

In this mode the primary aim is to reduce the time interval between any write 
happening to content and before it gets reflected in queue result. With current 
async indexing the latency can be from 5 sec to 1 minute depending on cluster 
load and how fast is async indexing. With nrt mode it would be ensured that 
even if async indexer does not catch up fast the local index would pick up the 
changes and hence recent addition would reflect in query result.

h4. sync
In this mode the lucene document would be added to index and IndexReader would 
be *immediately* refreshed. Functionally this would be similar to property 
index. This mode has lower performance compared to {{nrt}}. 

This mode should be used for those cases where code expects changes made to 
session immediately reflected in the query. So if a session set _/a/b/@foo_ to 
_bar_ and just after session save performs a query for 'bar' and expects 
/a/n/@foo to be part of result set then this mode should be used. 

Performance wise this mode is slower and slows down writes compared to 'nrt'

The indexes created under hybrid index are local and maintain index data 
between last async index cycle to most recent commit. Any search would be 
performed via MultiReader with readers from local index and another from index 
built as part of async indexing.


h3. B - Usage

To enable this mode for any index you need to make the {{async}} property as a 
multi value property with following values

* {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode
* {{async}} = [{{async}}, {{sync}}] - Enables the sync mode

{{LuceneIndexProviderService}} - Provides some tuning configuration which can 
be modfied as per setup requirements


h4. Implementation Detail

Most of the new code lives under 
{{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}} package. For any 
commit involving any index definition marked with {{nrt}} or {{sync}} 
{{LuceneIndexEditorProvider}} would return a {{LuceneIndexEditor}} backed by 
{{LocalIndexWriterFactory}}. This factory would use {{LocalIndexWriter}} and 
stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}. This holder 
instance is stored as part of {{CommitContext}} (which is stored in 
{{CommitInfo}} associated with the commit).

Once merge is done for that commit the change is picked by 
{{LocalIndexObserver}} (a sync observer). This observer would then look for 
{{LuceneDocumentHolder}} and if found would process the {{LuceneDoc}} stored in 
it

* For documents belonging to {{nrt}} mode it would add the docs to 
{{DocumentQueue}}
* For documents belonging ti {{sync}} mode it would directly write the document 
to {{NRTIndex}} configured for that index

{{DocumentQueue}} asynchronously picks up the docs from the queue and then 
write them to the index. While adding docs to the queue it can block for small 
time and if queue remains full then doc would be _dropped and not added to 
queue_. So indexing here is on best effort basis

*NRTIndex*
On indexing side each index (represented by {{IndexNode}}) has a matching 
{{NRTIndex}} which is constructed from {{NRTIndexFactory}}. Whenever a new 
{{IndexNode}} instance is created as a result of change in async index (via 
{{IndexTracker}}) the factory would create a new {{NRTIndex}} for that. It 
keeps maximum 2 instance of {{NRTIndex}} and closes and garbage collect older 
onces. So a {{NRTIndex}} would only have index data for the data indexed 
between 2 consecutive async indexing cycle.

{{NRTIndex}} provides access to {{IndexWriter}} which is used by 
{{DocumentQueue}} to write documents to it. It also creates {{IndexReader}} 
which is obtained from {{IndexWriter}} making use of [Lucene NRT 
Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch]

{{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines 
how and when the reader should be refreshed. The policy instance is also made 
aware of the changes done to index. For {{nrt}} indexes {{TimedRefreshPolicy}} 
is used which by default refreshes the reader after 1 sec delay. For {{sync}} 
index {{RefreshOnWritePolicy}} is used which refreshes the reader after any 
writes

*Avoiding Deletes*

The indexing logic avoids deleting any document in Lucene index. So if 
/a/b/@foo is updated say 3 times between 2 async index cycle

* /a/b/@foo = 'x'
* /a/b/@foo = 'y'
* /a/b/@foo = 'z'

Then Lucene index would have 3 documents added (no updated). Then 
{{LucenePropertyIndex}} would match either of 3 depending on query criteria. 
Say if query is for foo='x' the {{LucenePropertyIndex}} would return /a/b as 
part of Cursor. The cursor used is a unique cursor so if Lucene returns three 
documents then only first one would result in entry to cursor and others would 
be ignored

Later query engine (QE) would evaluate the /a/b against the query criteria as 
per {{ContentSession}} revision and if node value at that time matches then 
result would be returned to end user otherwise it would be skipped. So if per 
current root NodeState /a/b@foo='x' and for a query on foo='y' 
LucenePropertyIndex returns /a/b then QE would filter out that result

So in no case correctness of the result would get affected. This allows us to 
avoid deleting documents in Lucene index.

h3. C - Benchmark

A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It 
creates multiple indexes (_numOfIndexes_ = 10) to simulate a system having 
multiple indexes defined and then creates node with property {{foo}} being set 
with value as per enum _Status_. Each thread then creates nodes in breadth 
first fashion (defaults to 5 child node per node and then for each child node). 

In addition there is a {{Searcher}} thread which queries for different values 
and a {{Mutator}} which modifies the values
* refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt
* asyncInterval - 5 - Time in seconds for async indexer
* queueSize - 1000 - Size of queue used by {{DocumentQueue}}
* hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used 
otherwise property index would be used
* indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if 
hybridIndexEnabled
* useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid 
compression which slows down the searches (OAK-1737)

{noformat}
java  -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark 
--concurrency=5 HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS
{noformat}

_Results would be posted soon_

h3. D -Pending Feature Work

* Support for listening to external changes and then update the {{nrt}} indexes 
based on those changes
* JMX MBean around NRTIndexFactory to see rate of change etc



was (Author: chetanm):
Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for 
review.

h3. A - Purpose

Hybrid index provides 2 indexing modes

h4. nrt
In this mode for each commit Lucene Documents would be created as part of sync 
commit and would be added to a *local* index asynchronously where the 
IndexReader would be refreshed with _refresh interval_ of 1 sec

h4. sync
In this mode the lucene document would be added to index and IndexReader would 
be *immediately* refreshed. Functionally this would be similar to property 
index. This mode has lower performance compared to {{nrt}}. 

This mode should be used for those cases where code expects changes made to 
session immediately reflected in the query. So if a session set _/a/b/@foo_ to 
_bar_ and just after session save performs a query for 'bar' and expects 
/a/n/@foo to be part of result set then this mode should be used. 

Performance wise this mode is slower and slows down writes compared to 'nrt'

The indexes created under hybrid index are local and maintain index data 
between last async index cycle to most recent commit. Any search would be 
performed via MultiReader with readers from local index and another from index 
built as part of async indexing.


h3. B - Usage

To enable this mode for any index you need to make the {{async}} property as a 
multi value property with following values

* {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode
* {{async}} = [{{async}}, {{sync}}] - Enables the sync mode

{{LuceneIndexProviderService}} - Provides some tuning configuration which can 
be modfied as per setup requirements


h4. Implementation Detail

Most of the new code lives under 
{{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}} package. For any 
commit involving any index definition marked with {{nrt}} or {{sync}} 
{{LuceneIndexEditorProvider}} would return a {{LuceneIndexEditor}} backed by 
{{LocalIndexWriterFactory}}. This factory would use {{LocalIndexWriter}} and 
stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}. This holder 
instance is stored as part of {{CommitContext}} (which is stored in 
{{CommitInfo}} associated with the commit).

Once merge is done for that commit the change is picked by 
{{LocalIndexObserver}} (a sync observer). This observer would then look for 
{{LuceneDocumentHolder}} and if found would process the {{LuceneDoc}} stored in 
it

* For documents belonging to {{nrt}} mode it would add the docs to 
{{DocumentQueue}}
* For documents belonging ti {{sync}} mode it would directly write the document 
to {{NRTIndex}} configured for that index

{{DocumentQueue}} asynchronously picks up the docs from the queue and then 
write them to the index. 

*NRTIndex*
On indexing side each index (represented by {{IndexNode}}) has a matching 
{{NRTIndex}} which is constructed from {{NRTIndexFactory}}. Whenever a new 
{{IndexNode}} instance is created as a result of change in async index (via 
{{IndexTracker}}) the factory would create a new {{NRTIndex}} for that. It 
keeps maximum 2 instance of {{NRTIndex}} and closes and garbage collect older 
onces. So a {{NRTIndex}} would only have index data for the data indexed 
between 2 consecutive async indexing cycle.

{{NRTIndex}} provides access to {{IndexWriter}} which is used by 
{{DocumentQueue}} to write documents to it. It also creates {{IndexReader}} 
which is obtained from {{IndexWriter}} making use of [Lucene NRT 
Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch]

{{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines 
how and when the reader should be refreshed. The policy instance is also made 
aware of the changes done to index. For {{nrt}} indexes {{TimedRefreshPolicy}} 
is used which by default refreshes the reader after 1 sec delay. For {{sync}} 
index {{RefreshOnWritePolicy}} is used which refreshes the reader after any 
writes

*Avoiding Deletes*

The indexing logic avoids deleting any document in Lucene index. So if 
/a/b/@foo is updated say 3 times between 2 async index cycle

* /a/b/@foo = 'x'
* /a/b/@foo = 'y'
* /a/b/@foo = 'z'

Then Lucene index would have 3 documents added (no updated). Then 
{{LucenePropertyIndex}} would match either of 3 depending on query criteria. 
Say if query is for foo='x' the {{LucenePropertyIndex}} would return /a/b as 
part of Cursor. The cursor used is a unique cursor so if Lucene returns three 
documents then only first one would result in entry to cursor and others would 
be ignored

Later query engine (QE) would evaluate the /a/b against the query criteria as 
per {{ContentSession}} revision and if node value at that time matches then 
result would be returned to end user otherwise it would be skipped. So if per 
current root NodeState /a/b@foo='x' and for a query on foo='y' 
LucenePropertyIndex returns /a/b then QE would filter out that result

So in no case correctness of the result would get affected. This allows us to 
avoid deleting documents in Lucene index.

h3. C - Benchmark

A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It 
creates multiple indexes (_numOfIndexes_ = 10) to simulate a system having 
multiple indexes defined and then creates node with property {{foo}} being set 
with value as per enum _Status_. Each thread then creates nodes in breadth 
first fashion (defaults to 5 child node per node and then for each child node). 

In addition there is a {{Searcher}} thread which queries for different values 
and a {{Mutator}} which modifies the values
* refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt
* asyncInterval - 5 - Time in seconds for async indexer
* queueSize - 1000 - Size of queue used by {{DocumentQueue}}
* hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used 
otherwise property index would be used
* indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if 
hybridIndexEnabled
* useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid 
compression which slows down the searches (OAK-1737)

{noformat}
java  -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark 
--concurrency=5 HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS
{noformat}

_Results would be posted soon_

h3. D -Pending Feature Work

* Support for listening to external changes and then update the {{nrt}} indexes 
based on those changes
* JMX MBean around NRTIndexFactory to see rate of change etc


> Lucene hybrid index
> -------------------
>
>                 Key: OAK-4412
>                 URL: https://issues.apache.org/jira/browse/OAK-4412
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene
>            Reporter: Tomek Rękawek
>            Assignee: Chetan Mehrotra
>             Fix For: 1.6
>
>         Attachments: OAK-4412-v1.diff, OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After 
> performing some stress-tests with a geo-distributed Mongo cluster, we've 
> found out that updating property indexes is a large part of the overall 
> traffic.
> The asynchronous index would be an answer here (as the index update won't be 
> made in the client request thread), but the AEM requires the updates to be 
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a 
> synchronous, locally-stored counterpart that will persist only the data since 
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local 
> files. Once the "main" Lucene index is being updated, the local index will be 
> purged.
> Queries will use an union of results from the {{lucene}} and 
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using 
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the 
> OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (OAK-4412) Lucene hybrid index

Reply via email to