indexing.md

thomasm Tue, 21 Mar 2017 10:05:05 -0700

Author: thomasm
Date: Tue Mar 21 17:04:34 2017
New Revision: 1788005

URL: http://svn.apache.org/viewvc?rev=1788005&view=rev
Log:
OAK-5946 - Document indexing flow (review)


Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
URL: 
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md?rev=1788005&r1=1788004&r2=1788005&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md Tue Mar 21 
17:04:34 2017
@@ -43,17 +43,20 @@
   
 ## <a name="overview"></a> Overview
   
-For queries to perform well Oak supports indexing content stored in 
repository. Indexing works
-on diff between the base NodeState and modified NodeState. Depending on how 
diff is performed and
-when the index content gets updated there are 3 types of indexing modes
+For queries to perform well, Oak supports indexing of content that is stored 
in the repository. 
+Indexing works on comparing different versions of the node data
+(technically, `Diff` between the base `NodeState` and the modified 
`NodeState`). 
+There are indexing modes that define
+how comparing is performed, and when the index content gets updated:
   
 1. Synchronous Indexing
 2. Asynchronous Indexing
-3. Near real time indexing
+3. Near Real Time (NRT) Indexing
 
-Indexing makes use of [Commit 
Editors](../architecture/nodestate.html#commit-editors). Some of the editors
-are `IndexEditor` which are responsible for updating index content based on 
changes in main content. Currently
-Oak has following in built `IndexEditor`s
+Indexing makes use of [Commit 
Editors](../architecture/nodestate.html#commit-editors). 
+Some of the editors are of type `IndexEditor`, which are responsible for 
updating index content 
+based on changes in main content. 
+Currently, Oak has following in built editors:
 
 1. PropertyIndexEditor
 2. ReferenceEditor
@@ -62,21 +65,24 @@ Oak has following in built `IndexEditor`
 
 ### <a name="new-1.6"></a> New in 1.6
 
-* [Near Real Time Indexing](#nrt-indexing)
+* [Near Real Time (NRT) Indexing](#nrt-indexing)
 * [Multiple Async indexers setup via OSGi config](#async-index-setup)
 * [Isolating Corrupt Indexes](#corrupt-index-handling)
 
 ## <a name="indexing-flow"></a> Indexing Flow
 
-`IndexEditor` are invoked as part of commit or as part of asynchronous diff 
process. For both cases at some stage
-diff is performed between _before_ and _after_ state and passed to 
`IndexUpdate` which is responsible for invoking
-`IndexEditor` based on _discovered_ index definitions.
+The `IndexEditor` is invoked as part of a commit (`Session.save()`), 
+or as part of the asynchronous "diff" process. 
+For both cases, at some stage "diff" is performed between the _before_ and the 
_after_ state, 
+and passed to `IndexUpdate`, which is responsible for invoking the 
`IndexEditor`
+based on the _discovered_ index definitions.
 
 ### <a name="index-defnitions"></a> Index Definitions
 
-Index definitions are nodes of type `oak:QueryIndexDefinition` which are 
stored under a special node named `oak:index`.
-As part of diff traversal at each level `IndexUpdate` would look for 
`oak:index` nodes. Below is the canonical index 
-definition structure
+Index definitions are nodes of type `oak:QueryIndexDefinition`
+which are stored under a special node named `oak:index`.
+As part of diff traversal, at each level `IndexUpdate` looks for `oak:index` 
nodes. 
+Below is the canonical index definition structure:
 
     /oak:index/indexName
       - jcr:primaryType = "oak:QueryIndexDefinition"
@@ -84,85 +90,100 @@ definition structure
       - async (string) multiple
       - reindex (boolean)
       
+The index definitions nodes have following properties:
 
-The index definitions nodes have following properties
-
-1. `type` - It determines the _type_ of index. Based on the `type` 
`IndexUpdate` would look for `IndexEditor` of given 
-    type from registered `IndexEditorProvider`. For out of the box Oak setup 
it can have one of the following value
-    * `reference` -  Configured with out of box setup
-    * `counter` - Configured with out of box setup
+1. `type` - It determines the _type_ of index. Based on the `type`, 
+    `IndexUpdate` looks for an `IndexEditor` of the given 
+    type from the registered `IndexEditorProvider`. 
+    For out-of-the-box Oak setup, it can have one of the following values
+    * `reference` -  Configured with the out-of-the-box setup
+    * `counter` - Configured with the out-of-the-box setup
     * `property`
     * `lucene`
     * `solr`
-2. `async` - It determines if the index is to be updated synchronously or 
asynchronously. It can have following values
-    * `sync` - Also the default value. It indicates that index is meant to be 
updated as part of commit
+2. `async` - This determines if the index is to be updated synchronously or 
asynchronously. 
+    It can have following values:
+    * `sync` - The default value. It indicates that index is meant to be 
updated as part of each commit.
     * `nrt`  - Indicates that index is a [near real time](#nrt-indexing) 
index. 
-    * `async` - Indicates that index is to be updated asynchronously. In such 
a case this value is used to determine
+    * `async` - Indicates that index is to be updated asynchronously. 
+       In such a case, this value is used to determine
        the [indexing lane](#indexing-lane)
     * Any other value which ends in `async`. 
-3. `reindex` - If set to `true` reindexing would be performed for that index. 
Post which the property would be removed.
+3. `reindex` - If set to `true`, reindexing is performed for that index. 
+    After reindexing is done, the property value is set to `false`.
     Refer to [reindexing](#reindexing) for more details.
     
-Based on above 2 properties `IndexUpdate` creates `IndexEditor` instances as 
it traverses the diff and registers them
-with itself passing on the callbacks for various changes
+Based on the above two properties, the `IndexUpdate` creates an `IndexEditor` 
instances 
+as it traverses the "diff", and registers them with itself, passing on the 
callbacks for various changes.
 
 #### <a name="oak-index-nodes"></a>oak:index node 
 
-Indexing logic supports placing `oak:index` nodes at any path. Depending on 
the location such indexes would only index
-content which are present under those paths. So for e.g. if 'oak:index' is 
present at _'/content/oak:index'_ then indexes
-defined under that node would only index repository state present under 
_'/content'_
-
-Depending on type of index one can create these index definitions under root 
path ('/') or non root paths. Currently 
-only `lucene` indexes support creating index definitions at non root paths. 
`property` indexes can only be created 
-under root path i.e. under '/'
+Indexing logic supports placing `oak:index` nodes at any path. 
+Depending on the location, such indexes only index content which are present 
under those paths. 
+So for example, if 'oak:index' is present at _'/content/oak:index'_, then 
indexes
+defined under that node only index repository data present under _'/content'_.
+
+Depending on the type of the index, one can create these index definitions 
under the root path ('/'), 
+or non root paths. 
+Currently only `lucene` indexes support creating index definitions at non-root 
paths. 
+`property` indexes can only be created under the root path, that is, under '/'.
 
 ### <a name="sync-indexing"></a> Synchronous Indexing
 
-Under synchronous indexing the index content gets updates as part of commit 
itself. Changes to both index content
-and main content are done atomically in single commit. 
+Under synchronous indexing, the index content gets updates as part of commit 
itself. 
+Changes to both the main content, as well as the index content, are done 
atomically in a single commit. 
 
-This mode is currently supported by `property` and `reference` indexes
+This mode is currently supported by `property` and `reference` indexes.
 
 ### <a name="async-indexing"></a> Asynchronous Indexing
 
-Asynchronous Indexing (also referred as async indexing) is performed using 
periodic scheduled jobs. As part of setup
-Oak would schedule certain periodic jobs which would perform diff of the 
repository content and update the index content
-based on that diff. 
-
-Each periodic job i.e. `AsyncIndexUpdate` is assigned to an [indexing 
lane](#indexing-lane) and is scheduled to run at 
-certain interval. At time of execution the job would perform work
-
-1. Look for last indexed state via stored checkpoint data. If such a 
checkpoint exist then resolve the `NodeState` for 
-   that checkpoint. If no such state exist or no such checkpoint is present 
then it treats it as initial indexing case where 
-   base state is set to empty. This state is considered as `before` state
-2. Create a checkpoint for _current_ state and refer to this as `after` state
-3. Create an `IndexUpdate` instance bound to current _indexing lane_ and 
trigger a diff between the `before` and
-   `after` state
-4. `IndexUpdate` would then pick up index definitions which are bound to 
current indexing lane and would create 
-   `IndexEditor` instances for them and pass them the diff callbacks
-5. The diff traverses in a depth first manner and at the end of diff the 
`IndexEditor` would do final changes for 
-   current indexing run. Depending on index implementation the index data can 
be either stored in NodeStore itself 
-   (e.g. lucene) or in any remote store (e.g. solr)
-6. `AsyncIndexUpdate` would then update the last indexed checkpoint to current 
checkpoint and do a commit. 
-
-Such async indexes are _eventually consistent_ with the repository state and 
lag behind the latest repository state
-by some time. However the index content would be eventually consistent and 
never end up in wrong state with respect
+Asynchronous indexing (also referred as async indexing) is performed using 
periodic scheduled jobs. 
+As part of the setup, Oak schedules certain periodic jobs which perform 
+diff of the repository content, and update the index content based on that 
diff. 
+
+Each periodic `AsyncIndexUpdate` job, is assigned to an [indexing 
lane](#indexing-lane), 
+and is scheduled to run at a certain interval. 
+At time of execution, the job perform its work:
+
+1. Look for the last indexed state via stored checkpoint data. 
+   If such a checkpoint exist, then resolve the `NodeState` for that 
checkpoint. 
+   If no such state exist, or no such checkpoint is present, 
+   then it treats it as initial indexing, in which case the base state is 
empty. 
+   This state is considered the `before` state.
+2. Create a checkpoint for _current_ state and refer to this as `after` state.
+3. Create an `IndexUpdate` instance bound to the current _indexing lane_, 
+   and trigger a diff between the `before` and the `after` state.
+4. `IndexUpdate` will then pick up index definitions which are bound to the 
current indexing lane, 
+   will create `IndexEditor` instances for them, 
+   and pass them the diff callbacks.
+5. The diff traverses in a depth-first manner, 
+   and at the end of diff, the `IndexEditor` will do final changes for the 
current indexing run. 
+   Depending on the index implementation, the index data can be either stored 
in NodeStore itself
+   (for indexes of type `lucene` and `property`), or in any remote store (for 
type `solr`).
+6. `AsyncIndexUpdate` will then update the last indexed checkpoint to the 
current checkpoint 
+   and do a commit. 
+
+Such async indexes are _eventually consistent_ with the repository state, 
+and lag behind the latest repository state by some time. 
+However the index content is eventually consistent, and never end up in wrong 
state with respect
 to repository state.
 
 #### <a name="checkpoint"></a> Checkpoint
 
-Checkpoint is a mechanism whereby a client of NodeStore can request it to 
ensure that repository state at that time
-can be preserved and not garbage collected by revision garbage collection 
process. Later that state can be retrieved
-back from NodeStore by passing the checkpoint back. You can treat checkpoint 
like a named revision or a tag in git 
-repo.  
+A checkpoint is a mechanism, whereby a client of `NodeStore` can request Oak 
to ensure 
+that the repository state (snapshot) at that time can be preserved, and not 
garbage collected 
+by the revision garbage collection process. 
+Later, that state can be retrieved from the NodeStore by passing the 
checkpoint. 
+You think of a checkpoint as a tag in a git repository, or as a named 
revision. 
 
 Async indexing makes use of checkpoint support to access older repository 
state. 
 
 #### <a name="indexing-lane"></a> Indexing Lane
 
-Indexing lane refers to a set of indexes which are to be indexed by given 
async indexer. Each index definition meant for
-async indexing defines an `async` property whose value is the name of indexing 
lane. For e.g. consider following 2 index
-definitions
+The term indexing lane refers to a set of indexes which are to be updated by a 
given async indexer.
+Each index definition meant for async indexing defines an `async` property, 
+whose value is the name of the indexing lane. 
+For example, consider following 2 index definitions:
 
     /oak:index/userIndex
       - jcr:primaryType = "oak:QueryIndexDefinition"
@@ -172,116 +193,131 @@ definitions
       - jcr:primaryType = "oak:QueryIndexDefinition"
       - async = "fulltext-async"
       
-Here _userIndex_ is bound to "async" indexing lane while _assetIndex_ is bound 
to  "fulltext-async" lane. Oak 
-[setup](#async-index-setup) would configure 2 `AsyncIndexUpdate` jobs one for 
"async" and one for "fulltext-async".
-When job for "async" would run it would only process index definition where 
`async` value is `async` while when job
-for "fulltext-async" would run it would pick up index definitions where 
`async` value is `fulltext-async`.
-
-These jobs can be scheduled to run at different intervals and also on 
different cluster nodes. Each job would keep its
-own bookkeeping of checkpoint state and can be [paused and 
resumed](#async-index-mbean) separately.
-
-Prior to Oak 1.4 there was only one indexing lane `async`. In Oak 1.4 support 
was added to create 2 lanes `async` and 
-`fulltext-async`. With 1.6 its possible to [create multiple 
lanes](#async-index-setup). 
+Here, _userIndex_ is bound to the "async" indexing lane, 
+while _assetIndex_ is bound to  the "fulltext-async" lane. 
+Oak [setup](#async-index-setup) configures two `AsyncIndexUpdate` jobs: 
+one for "async", and one for "fulltext-async".
+When the job for "async" is run, 
+it only processes index definition where the `async` value is `async`, 
+while when the job for "fulltext-async" is run,
+it only pick up index definitions where the `async` value is `fulltext-async`.
+
+These jobs can be scheduled to run at different intervals, and also on 
different cluster nodes. 
+Each job keeps its own bookkeeping of checkpoint state, 
+and can be [paused and resumed](#async-index-mbean) separately.
+
+Prior to Oak 1.4, there was only one indexing lane: `async`. 
+In Oak 1.4, support was added to create two lanes: `async` and 
`fulltext-async`. 
+With 1.6, it is possible to [create multiple lanes](#async-index-setup). 
 
 #### <a name="cluster"></a> Clustered Setup
 
-In a clustered setup it needs to be ensured by the host application that async 
indexing jobs for specific lanes are to 
-be run as singleton in the cluster. If `AsyncIndexUpdate` for same lane gets 
executed concurrently on different cluster
-nodes then it can lead to race conditions where old checkpoint gets lost 
leading to reindexing of the indexes.
+In a clustered setup, one needs to be ensured in the host application that 
+the async indexing jobs for specific lanes are to be run as singleton in the 
cluster. 
+If `AsyncIndexUpdate` for same lane gets executed concurrently on different 
cluster nodes,
+it leads to race conditions, where an old checkpoint gets lost, 
+leading to reindexing of the indexes.
 
-Refer to [clustering](../clustering.html#scheduled-jobs) for more details on 
how the host application should schedule
-such indexing jobs
+See also [clustering](../clustering.html#scheduled-jobs) 
+for more details on how the host application should schedule such indexing 
jobs.
 
 ##### <a name="async-index-lease"></a> Indexing Lease
 
-`AsyncIndexUpdate` has an inbuilt lease logic to ensure that even if the jobs 
gets scheduled to run on different cluster
-nodes then also only one of them runs. This is done by keeping a lease 
property which gets periodically updated as 
+`AsyncIndexUpdate` has an in-built "lease" logic to ensure that 
+even if the jobs gets scheduled to run on different cluster nodes, only one of 
them runs. 
+This is done by keeping a lease property, which gets periodically updated as 
 indexing progresses. 
 
-An `AsyncIndexUpdate` run would skip indexing if current lease has not expired 
i.e. if the last 
-update of lease was done long ago (default 15 mins) then it would be assumed 
that cluster node doing indexing is not 
-available and some other node would take over.
-
-The lease logic can delay start of indexing if the system is not stopped 
cleanly. As of Oak 1.6 this does not affect
-non clustered setup like those based on SegmentNodeStore but only [affects 
DocumentNodeStore][OAK-5159] based setups 
+An `AsyncIndexUpdate` run skip indexing if the current lease has not expired.
+If the last update of the lease was done long ago (default 15 mins), 
+then it is assumed that cluster node doing indexing is not available, 
+and some other node will take over.
+
+The lease logic can delay the start of indexing if the system is not stopped 
cleanly. 
+As of Oak 1.6, this does not affect non clustered setups like those based on 
SegmentNodeStore,
+but only [affects DocumentNodeStore][OAK-5159] based setups.
 
 #### <a name="async-index-lag"></a> Indexing Lag
 
-Async indexing jobs are by default configured to run at interval of 5 secs. 
Depending on the system load and diff size
-of content to be indexed the indexing may start lagging by longer time 
intervals. Due to this the indexing results would
-lag behind the repository state and may become stale i.e. new content added 
would show up in result after some time.
+Async indexing jobs are by default configured to run at an interval of 5 
seconds. 
+Depending on the system load and diff size of content to be indexed, 
+the indexing may start lagging by a longer time interval. 
+Due to this, the indexing results can lag behind the repository state, 
+and may become stale, that is new content added will show up in query results 
after some time.
 
-`IndexStats` MBean keeps a time series and metrics stats for the indexing 
frequency. This can be used to track the 
-indexing state
+The `IndexStats` MBean keeps a time series and metrics stats for the indexing 
frequency. 
+This can be used to track the indexing state.
 
-[NRT Indexing](#nrt-indexing) introduced in Oak 1.6 would help in such 
situations and can keep the results more upto 
-date
+[NRT Indexing](#nrt-indexing) introduced in Oak 1.6 helps in such situations, 
+and can keep the results more up to date.
 
 #### <a name="async-index-setup"></a> Setup
 
 `@since Oak 1.6`
 
-Async indexers can be configure via OSGi config for 
`org.apache.jackrabbit.oak.plugins.index.AsyncIndexerService`
+Async indexers can be configure via the OSGi config for 
`org.apache.jackrabbit.oak.plugins.index.AsyncIndexerService`.
 
 ![Async Indexing Config](async-index-config.png)
 
-Different lanes can be configured by adding more rows of _Async Indexer 
Configs_. Prior to 1.6 the indexers were
-created programatically while constructing Oak.
+Different lanes can be configured by adding more rows of _Async Indexer 
Configs_. 
+Prior to 1.6, the indexers were created programatically while constructing Oak.
 
 #### <a name="async-index-mbean"></a> Async Indexing MBean
 
-For each configured async indexer in the setup the indexer exposes a 
`IndexStatsMBean` which provides various
-stats around current indexing state. 
+For each configured async indexer in the setup, the indexer exposes a 
`IndexStatsMBean`, 
+which provides various stats around the current indexing state:
 
     org.apache.jackrabbit.oak: async (IndexStats)
     org.apache.jackrabbit.oak: fulltext-async (IndexStats)
 
 It provide details like
 
-* FailingIndexStats - Stats around indexes which are [failing and marked as 
corrupt](#corrupt-index-handling)
-* LastIndexedTime - Time upto which repository state has been indexed
-* Status - running, done, failing etc
-* Failing - boolean flag indicating that indexing has been failing due to some 
issue. This can be monitored
-  for detecting if indexer is healthy or not
-* ExecutionCount - Time series data around when number of execution for 
various time intervals
+* FailingIndexStats - Stats around indexes which are [failing and marked as 
corrupt](#corrupt-index-handling).
+* LastIndexedTime - Time up to which the repository state has been indexed.
+* Status - running, done, failing etc.
+* Failing - boolean flag indicating that indexing has been failing due to some 
issue. 
+  This can be monitored for detecting if indexer is healthy or not.
+* ExecutionCount - Time series data around the number of runs for various time 
intervals.
 
 Further it provides operations like
 
-* pause - Pauses the indexer
-* abortAndPause - Aborts any running indexing cycle and pauses the indexer. 
Invoke 'resume' once you are ready 
-  to resume indexing again
-* resume - Resume the indexing
+* pause - Pauses the indexer.
+* abortAndPause - Aborts any running indexing cycle and pauses the indexer. 
+  Invoke 'resume' once you are ready to resume indexing again.
+* resume - Resume indexing.
 
 #### <a name="corrupt-index-handling"></a> Isolating Corrupt Indexes
 
 `Since 1.6`
 
-AsyncIndexerService would now mark any index which fails to update for 30 mins 
(configurable) as `corrupt` and 
-ignore such indexes from further indexing. 
+The `AsyncIndexerService` marks any index which fails to update for 30 mins 
+(configurable) as `corrupt`, and ignore such indexes from further indexing. 
 
-When any index is marked as corrupt following log entry would be made
+When any index is marked as corrupt, the following log entry is made:
 
-    2016-11-22 12:52:35,484 INFO  NA [async-index-update-fulltext-async] 
o.a.j.o.p.i.AsyncIndexUpdate - Marking 
-    [/oak:index/lucene] as corrupt. The index is failing since Tue Nov 22 
12:51:25 IST 2016 ,1 indexing cycles, failed 
-    7 times, skipped 0 time 
+    2016-11-22 12:52:35,484 INFO  NA [async-index-update-fulltext-async] 
o.a.j.o.p.i.AsyncIndexUpdate - 
+    Marking [/oak:index/lucene] as corrupt. The index is failing since Tue Nov 
22 12:51:25 IST 2016, 
+    1 indexing cycles, failed 7 times, skipped 0 time 
 
-Post this when any new content gets indexed and any such corrupt index is 
skipped then following warn entry would be made
+Post this, when any new content gets indexed and any such corrupt index is 
skipped, 
+the following warn entry is made:
 
-    2016-11-22 12:52:35,485 WARN  NA [async-index-update-fulltext-async] 
o.a.j.o.p.index.IndexUpdate - Ignoring corrupt 
-    index [/oak:index/lucene] which has been marked as corrupt since 
[2016-11-22T12:51:25.492+05:30]. This index MUST be 
-    reindexed for indexing to work properly 
+    2016-11-22 12:52:35,485 WARN  NA [async-index-update-fulltext-async] 
o.a.j.o.p.index.IndexUpdate - 
+    Ignoring corrupt index [/oak:index/lucene] which has been marked as 
corrupt since 
+    [2016-11-22T12:51:25.492+05:30]. This index MUST be reindexed for indexing 
to work properly 
     
-This info would also be seen in MBean
+This info is also seen in the MBean
 
 ![Corrupt Index stats in IndexStatsMBean](corrupt-index-mbean.png)
     
-Later once the index is reindexed following log entry would be made
+Later, once the index is reindexed, the following log entry is made
 
-    2016-11-22 12:56:25,486 INFO  NA [async-index-update-fulltext-async] 
o.a.j.o.p.index.IndexUpdate - Removing corrupt 
-    flag from index [/oak:index/lucene] which has been marked as corrupt since 
[corrupt = 2016-11-22T12:51:25.492+05:30] 
+    2016-11-22 12:56:25,486 INFO  NA [async-index-update-fulltext-async] 
o.a.j.o.p.index.IndexUpdate - 
+    Removing corrupt flag from index [/oak:index/lucene] which has been marked 
as corrupt since 
+    [corrupt = 2016-11-22T12:51:25.492+05:30] 
 
-This feature can be disabled by setting `failingIndexTimeoutSeconds` to 0 in 
AsyncIndexService config. Refer to 
-[OAK-4939][OAK-4939] for more details
+This feature can be disabled by setting `failingIndexTimeoutSeconds` to 0 in 
the `AsyncIndexService` config. 
+See also [OAK-4939][OAK-4939] for more details.
 
 ### <a name="nrt-indexing"></a> Near Real Time Indexing
 
@@ -289,61 +325,66 @@ This feature can be disabled by setting
 
 _This mode is only supported for `lucene` indexes_
 
-Lucene indexes perform well for evaluating complex queries and also have the 
benefit of being evaluated locally with
-copy-on-read support. However they are `async` index and depending on system 
load can lag behind the repository state.
-For cases where such lag (of order of minutes) is not acceptable one has to 
use `property` indexes. For such cases
-Oak 1.6 has [added support for near real time indexing][OAK-4412]
+Lucene indexes perform well for evaluating complex queries, 
+and also have the benefit of being evaluated locally with copy-on-read 
support. 
+However, they are `async`, and depending on system load can lag behind the 
repository state.
+For cases where such lag (in the order of minutes) is not acceptable, 
+one has to use `property` indexes. 
+For such cases, Oak 1.6 has [added support for near real time 
indexing][OAK-4412]
 
 ![NRT Index Flow](index-nrt.png)
 
-In this mode the indexing would happen in 2 modes and query would consult 
multiple indexes. The diagram above shows
-indexing flow with time. In above flow
+In this mode, the indexing happen in two modes, and a query will consult 
multiple indexes. 
+The diagram above shows the indexing flow with time. In the above flow,
 
 * T1, T3 and T5 - Time instances at which checkpoint is created
-* T2 and T4 - Time instance when async indexer run completed and indexes were 
updated
+* T2 and T4 - Time instance when async indexer runs completed and indexes were 
updated
 * Persisted Index 
-    * v2 - Index version v2 which has repository state upto time T1 indexed
-    * v3 - Index version v2 which has repository state upto time T3 indexed
+    * v2 - Index version v2, which has repository state up to time T1 indexed
+    * v3 - Index version v2, which has repository state up to time T3 indexed
 * Local Index
-    * NRT1 - Local index which repository state between time T2 and T4 indexed
-    * NRT2 - Local index which repository state between time T4 and T6 indexed
+    * NRT1 - Local index, which has repository state between time T2 and T4 
indexed
+    * NRT2 - Local index, which has repository state between time T4 and T6 
indexed
     
-As repository state changes with time Async indexer would run and index state 
between last known checkpoint and 
-current state when that run started. So when asyn run 1 completed the 
persisted index has repository state indexed
-upto time T3.
-
-Now without NRT index support if any query is performed between time T2 and T4 
it would only see index result for
-repository state at time T1 as thats state which the persisted indexes have 
data for. Any change after that would not be
-seen untill next async indexing cycle complete (by time T4). 
-
-With NRT indexing support indexing would happen at 2 places
-
-* Persisted Index - This is the index which is updated via async indexer run. 
This flow would remain same i.e. it 
-  would be periodically updated by the indexer run
-* Local Index - In addition to persisted index each cluster node would also 
maintain a local index. This index would 
-  only keep data between 2 async indexer run. Post each run the previous index 
would be discarded and a new index would
-  be built (actually previous index is retained for one cycle)
+As the repository state changes with time, the Async indexer will run and 
index the 
+state between last known checkpoint and current state when that run started. 
+So when asyncc run 1 completed, the persisted index has the repository state 
indexed up to time T3.
+
+Now without NRT index support, if any query is performed between time T2 and 
T4, 
+it can only see index result for repository state at time T1, 
+as thats the state where the persisted indexes have data for. 
+Any change after that can not be seen until the next async indexing cycle is 
complete (by time T4). 
+
+With NRT indexing support indexing will happen at two places:
+
+* Persisted Index - This is the index which is updated via the async indexer 
run. 
+  This flow remains the same, it will be periodically updated by the indexer 
run.
+* Local Index - In addition to persisted index, each cluster node will also 
maintain a local index. 
+  This index only keeps data between two async indexer runs. 
+  Post each run, the previous index is discarded, and a new index is built 
+  (actually the previous index is retained for one cycle).
   
-Any query making use of such an index would make use of both indexes. With 
this new content added in repository
-after the last async index run would also show up quickly. 
+Any query making use of such an index will automatically make use of both the 
persisted and the local indexes. 
+With this, new content added in the repository after the last async index run 
will also show up quickly.
 
 #### <a name="nrt-indexing-modes"></a> NRT Indexing Modes
 
-NRT indexing can be enabled for any index by configuring the `async` property
+NRT (Near real time) indexing can be enabled for any index by configuring the 
`async` property:
 
     /oak:index/assetIndex
       - jcr:primaryType = "oak:QueryIndexDefinition"
       - async = ['fulltext-async', 'nrt']
       
-Here `async` value has been set to a multi value property where 
+Here, the `async` value has been set to a multi-valued property, with the
 
-* Indexing lane - Like `async` or `fulltext-async`
-* NRT Indexing Mode - `nrt` or `sync`
+* Indexing lane - For example `async` or `fulltext-async`,
+* NRT Indexing Mode - `nrt` or `sync`.
 
 ##### <a name="nrt-indexing-mode-nrt"></a> nrt
 
-In this mode the local index would be updated asynchronously on that cluster 
nodes post commit and the index reader 
-would be refreshed after 1 sec. So any change done should should show up on 
that cluster node in 1-2 secs
+In this mode, the local index is updated asynchronously on that cluster nodes 
post each commit, 
+and the index reader is refreshed each second. 
+So any change done should should show up on that cluster node within 1 to 2 
seconds.
 
     /oak:index/userIndex
       - jcr:primaryType = "oak:QueryIndexDefinition"
@@ -351,73 +392,81 @@ would be refreshed after 1 sec. So any c
 
 ##### <a name="nrt-indexing-mode-sync"></a> sync
 
-In this mode the local index would be updated synchronously on that cluster 
nodes post commit and the index reader 
-would be refreshed immediately. This mode performs slowly compared to the 
"nrt" mode
+In this mode, the local index is updated synchronously on that cluster nodes 
post each commit,
+and the index reader is refreshed immediately. 
+This mode performs more slowly compared to the "nrt" mode.
 
     /oak:index/userIndex
       - jcr:primaryType = "oak:QueryIndexDefinition"
       - async = ['async', 'sync']
       
-For a single node setup (like with SegmentNodeStore) this mode effectively 
makes async lucene index perform same as 
-synchronous property indexes. However 'nrt' mode performs better so using that 
would be preferable
-      
+For a single node setup (for example with the `SegmentNodeStore`), 
+this mode effectively makes async lucene index perform same as synchronous 
property indexes. 
+However, the 'nrt' mode performs better, so using that is preferable.
+
 #### <a name="nrt-indexing-cluster-setup"></a> Cluster Setup
 
-In cluster setup each cluster node would maintain its own local index for 
changes happening in that cluster node.
-In addition to that it would also index changes from other cluster node by 
relying on [Oak observation for external 
-changes][OAK-4808]. This depends on how frequently external changes are 
delivered. Due to this even with NRT indexing
-changes from other cluster node would take some more time to reflect in query 
result compared to local changes.
+In cluster setup, each cluster node maintains its own local index for changes 
happening in that cluster node.
+In addition to that, it also indexes changes from other cluster node by 
relying on 
+[Oak observation for external changes][OAK-4808]. 
+This depends on how frequently external changes are delivered. 
+Due to this, even with NRT indexing changes from other cluster nodes will take 
some more time 
+to be reflected in query results compared to local changes.
 
 #### <a name="nrt-indexing-config"></a> Configuration
 
-NRT indexing expose few configuration options as part of 
[LuceneIndexProviderService](lucene.html#osgi-config)
+NRT indexing expose a few configuration options as part of the 
[LuceneIndexProviderService](lucene.html#osgi-config):
+
+* `enableHybridIndexing` - Boolean property, defaults to `true`. 
+  Can be set to `false` to disable the NRT indexing feature completely.
+* `hybridQueueSize` - The size of the in memory queue used 
+  to hold Lucene documents for indexing in the `nrt` mode. 
+  The default size is 10000.
 
-* `enableHybridIndexing` - Boolean property defaults to `true`. Can be set to 
`false` to disable NRT indexing feature 
-  completely
-* `hybridQueueSize` - Size of in memory queue used to hold Lucene documents 
for indexing in `nrt` mode. Default size is
-  10000
-  
 ## <a name="reindexing"></a> Reindexing
 
-Reindexing of existing indexes is required in following scenarios
+Reindexing of existing indexes is required in the following scenarios:
 
-* Incompatible change in index definition - For example adding properties to 
the index which is already
-  present in repository
-* Corrupted Index - If the index is corrupt and `AsyncIndexUpdate` run fails 
with exception pointing to index being 
-  corrupt
+* Incompatible changes in the index definition - 
+  For example adding properties to the index which is already
+  present in repository.
+* Corrupted Index - If the index is corrupt and `AsyncIndexUpdate` run fails 
+  with an exception pointing to index being corrupt.
   
-Reindexing does not resolve other problems, such that queries not returning 
data. For such cases, it is _not_ 
-recommended to reindex (also because this can be very slow and use a lot of 
temporary disk space).
+Reindexing does not resolve other problems, such that queries not returning 
data. 
+For such cases, it is _not_ recommended to reindex (also because this can be 
very slow and use a lot of temporary disk space).
 If queries don't return the right data, then possibly the index is [not yet 
up-to-date][OAK-5159],
-or the query is incorrect, or included/excluded path settings are wrong (for 
Lucene indexes). Instead of reindexing, it 
-is suggested to first check the log file, modify the query so it uses a 
different index or traversal and run the query again.
+or the query is incorrect, or included/excluded path settings are wrong (for 
Lucene indexes). 
+Instead of reindexing, it is suggested to first check the log file, 
+modify the query so it uses a different index or traversal and run the query 
again.
 One case were reindexing can help is if the query engine picks a very slow 
index for some queries because the counter index 
 [got out of sync after adding and removing lots of nodes many times (fixed in 
recent version)][OAK-4065].
 For this case, it is recommended to verify the contents of the counter index 
first,
 and upgrade Oak before reindexing.
 
-Also note that with Oak 1.6 for Lucene indexes changes in index definition are 
only effective 
-[post reindexing](lucene.html#stored-index-definition)
+Also note that with Oak 1.6, for Lucene indexes, changes in the index 
definition are only effective 
+[post reindexing](lucene.html#stored-index-definition).
 
-To reindex any index set the `reindex` flag to `true` in index definition
+To reindex any index, set the `reindex` flag to `true` in index definition:
 
     /oak:index/userIndex
       - jcr:primaryType = "oak:QueryIndexDefinition"
       - async = ['async']
       - reindex = true
       
-Once changes are saved the index would be reindexed. For synchronous indexes 
the reindexing would be done
-as part of save (or commit) itself. While for asynchronous indexes are 
reindexed whenever the next async 
-indexing cycle run happens. Once reindexing starts following log entries can 
be seen in the log
+Once changes are saved, the index is reindexed. For synchronous indexes, 
+the reindexing is done as part of save (or commit) itself. 
+While for asynchronous indexes, reindex starts with the next async indexing 
cycle. 
+Once reindexing starts, the following log entries can be seen in the log:
 
     [async-index-update-async] o.a.j.o.p.i.IndexUpdate Reindexing will be 
performed for following indexes: [/oak:index/userIndex]
     [async-index-update-async] o.a.j.o.p.i.IndexUpdate Reindexing Traversed 
#100000 /home/user/admin 
     [async-index-update-async] o.a.j.o.p.i.AsyncIndexUpdate [async] Reindexing 
completed for indexes: [/oak:index/userIndex*(4407016)] in 30 min 
-    
-In both cases once reindexing is complete the `reindex` flag would be removed.
 
-For property index you can also make use of `PropertyIndexAsyncReindexMBean`. 
Refer to 
-[reindeinxing property indexes](property-index.html#reindexing) section for 
more details on that
+In both cases, once reindexing is complete, the `reindex` flag is removed.
+
+For a property index, you can also make use of the 
`PropertyIndexAsyncReindexMBean`. 
+See also the [reindeinxing property indexes](property-index.html#reindexing) 
section for more details on that.
 
 
 [OAK-5159]: https://issues.apache.org/jira/browse/OAK-5159

svn commit: r1788005 - /jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md

Reply via email to