Author: thomasm
Date: Tue Mar 21 17:04:34 2017
New Revision: 1788005
URL: http://svn.apache.org/viewvc?rev=1788005&view=rev
Log:
OAK-5946 - Document indexing flow (review)
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md
URL:
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md?rev=1788005&r1=1788004&r2=1788005&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/indexing.md Tue Mar 21
17:04:34 2017
@@ -43,17 +43,20 @@
## <a name="overview"></a> Overview
-For queries to perform well Oak supports indexing content stored in
repository. Indexing works
-on diff between the base NodeState and modified NodeState. Depending on how
diff is performed and
-when the index content gets updated there are 3 types of indexing modes
+For queries to perform well, Oak supports indexing of content that is stored
in the repository.
+Indexing works on comparing different versions of the node data
+(technically, `Diff` between the base `NodeState` and the modified
`NodeState`).
+There are indexing modes that define
+how comparing is performed, and when the index content gets updated:
1. Synchronous Indexing
2. Asynchronous Indexing
-3. Near real time indexing
+3. Near Real Time (NRT) Indexing
-Indexing makes use of [Commit
Editors](../architecture/nodestate.html#commit-editors). Some of the editors
-are `IndexEditor` which are responsible for updating index content based on
changes in main content. Currently
-Oak has following in built `IndexEditor`s
+Indexing makes use of [Commit
Editors](../architecture/nodestate.html#commit-editors).
+Some of the editors are of type `IndexEditor`, which are responsible for
updating index content
+based on changes in main content.
+Currently, Oak has following in built editors:
1. PropertyIndexEditor
2. ReferenceEditor
@@ -62,21 +65,24 @@ Oak has following in built `IndexEditor`
### <a name="new-1.6"></a> New in 1.6
-* [Near Real Time Indexing](#nrt-indexing)
+* [Near Real Time (NRT) Indexing](#nrt-indexing)
* [Multiple Async indexers setup via OSGi config](#async-index-setup)
* [Isolating Corrupt Indexes](#corrupt-index-handling)
## <a name="indexing-flow"></a> Indexing Flow
-`IndexEditor` are invoked as part of commit or as part of asynchronous diff
process. For both cases at some stage
-diff is performed between _before_ and _after_ state and passed to
`IndexUpdate` which is responsible for invoking
-`IndexEditor` based on _discovered_ index definitions.
+The `IndexEditor` is invoked as part of a commit (`Session.save()`),
+or as part of the asynchronous "diff" process.
+For both cases, at some stage "diff" is performed between the _before_ and the
_after_ state,
+and passed to `IndexUpdate`, which is responsible for invoking the
`IndexEditor`
+based on the _discovered_ index definitions.
### <a name="index-defnitions"></a> Index Definitions
-Index definitions are nodes of type `oak:QueryIndexDefinition` which are
stored under a special node named `oak:index`.
-As part of diff traversal at each level `IndexUpdate` would look for
`oak:index` nodes. Below is the canonical index
-definition structure
+Index definitions are nodes of type `oak:QueryIndexDefinition`
+which are stored under a special node named `oak:index`.
+As part of diff traversal, at each level `IndexUpdate` looks for `oak:index`
nodes.
+Below is the canonical index definition structure:
/oak:index/indexName
- jcr:primaryType = "oak:QueryIndexDefinition"
@@ -84,85 +90,100 @@ definition structure
- async (string) multiple
- reindex (boolean)
+The index definitions nodes have following properties:
-The index definitions nodes have following properties
-
-1. `type` - It determines the _type_ of index. Based on the `type`
`IndexUpdate` would look for `IndexEditor` of given
- type from registered `IndexEditorProvider`. For out of the box Oak setup
it can have one of the following value
- * `reference` - Configured with out of box setup
- * `counter` - Configured with out of box setup
+1. `type` - It determines the _type_ of index. Based on the `type`,
+ `IndexUpdate` looks for an `IndexEditor` of the given
+ type from the registered `IndexEditorProvider`.
+ For out-of-the-box Oak setup, it can have one of the following values
+ * `reference` - Configured with the out-of-the-box setup
+ * `counter` - Configured with the out-of-the-box setup
* `property`
* `lucene`
* `solr`
-2. `async` - It determines if the index is to be updated synchronously or
asynchronously. It can have following values
- * `sync` - Also the default value. It indicates that index is meant to be
updated as part of commit
+2. `async` - This determines if the index is to be updated synchronously or
asynchronously.
+ It can have following values:
+ * `sync` - The default value. It indicates that index is meant to be
updated as part of each commit.
* `nrt` - Indicates that index is a [near real time](#nrt-indexing)
index.
- * `async` - Indicates that index is to be updated asynchronously. In such
a case this value is used to determine
+ * `async` - Indicates that index is to be updated asynchronously.
+ In such a case, this value is used to determine
the [indexing lane](#indexing-lane)
* Any other value which ends in `async`.
-3. `reindex` - If set to `true` reindexing would be performed for that index.
Post which the property would be removed.
+3. `reindex` - If set to `true`, reindexing is performed for that index.
+ After reindexing is done, the property value is set to `false`.
Refer to [reindexing](#reindexing) for more details.
-Based on above 2 properties `IndexUpdate` creates `IndexEditor` instances as
it traverses the diff and registers them
-with itself passing on the callbacks for various changes
+Based on the above two properties, the `IndexUpdate` creates an `IndexEditor`
instances
+as it traverses the "diff", and registers them with itself, passing on the
callbacks for various changes.
#### <a name="oak-index-nodes"></a>oak:index node
-Indexing logic supports placing `oak:index` nodes at any path. Depending on
the location such indexes would only index
-content which are present under those paths. So for e.g. if 'oak:index' is
present at _'/content/oak:index'_ then indexes
-defined under that node would only index repository state present under
_'/content'_
-
-Depending on type of index one can create these index definitions under root
path ('/') or non root paths. Currently
-only `lucene` indexes support creating index definitions at non root paths.
`property` indexes can only be created
-under root path i.e. under '/'
+Indexing logic supports placing `oak:index` nodes at any path.
+Depending on the location, such indexes only index content which are present
under those paths.
+So for example, if 'oak:index' is present at _'/content/oak:index'_, then
indexes
+defined under that node only index repository data present under _'/content'_.
+
+Depending on the type of the index, one can create these index definitions
under the root path ('/'),
+or non root paths.
+Currently only `lucene` indexes support creating index definitions at non-root
paths.
+`property` indexes can only be created under the root path, that is, under '/'.
### <a name="sync-indexing"></a> Synchronous Indexing
-Under synchronous indexing the index content gets updates as part of commit
itself. Changes to both index content
-and main content are done atomically in single commit.
+Under synchronous indexing, the index content gets updates as part of commit
itself.
+Changes to both the main content, as well as the index content, are done
atomically in a single commit.
-This mode is currently supported by `property` and `reference` indexes
+This mode is currently supported by `property` and `reference` indexes.
### <a name="async-indexing"></a> Asynchronous Indexing
-Asynchronous Indexing (also referred as async indexing) is performed using
periodic scheduled jobs. As part of setup
-Oak would schedule certain periodic jobs which would perform diff of the
repository content and update the index content
-based on that diff.
-
-Each periodic job i.e. `AsyncIndexUpdate` is assigned to an [indexing
lane](#indexing-lane) and is scheduled to run at
-certain interval. At time of execution the job would perform work
-
-1. Look for last indexed state via stored checkpoint data. If such a
checkpoint exist then resolve the `NodeState` for
- that checkpoint. If no such state exist or no such checkpoint is present
then it treats it as initial indexing case where
- base state is set to empty. This state is considered as `before` state
-2. Create a checkpoint for _current_ state and refer to this as `after` state
-3. Create an `IndexUpdate` instance bound to current _indexing lane_ and
trigger a diff between the `before` and
- `after` state
-4. `IndexUpdate` would then pick up index definitions which are bound to
current indexing lane and would create
- `IndexEditor` instances for them and pass them the diff callbacks
-5. The diff traverses in a depth first manner and at the end of diff the
`IndexEditor` would do final changes for
- current indexing run. Depending on index implementation the index data can
be either stored in NodeStore itself
- (e.g. lucene) or in any remote store (e.g. solr)
-6. `AsyncIndexUpdate` would then update the last indexed checkpoint to current
checkpoint and do a commit.
-
-Such async indexes are _eventually consistent_ with the repository state and
lag behind the latest repository state
-by some time. However the index content would be eventually consistent and
never end up in wrong state with respect
+Asynchronous indexing (also referred as async indexing) is performed using
periodic scheduled jobs.
+As part of the setup, Oak schedules certain periodic jobs which perform
+diff of the repository content, and update the index content based on that
diff.
+
+Each periodic `AsyncIndexUpdate` job, is assigned to an [indexing
lane](#indexing-lane),
+and is scheduled to run at a certain interval.
+At time of execution, the job perform its work:
+
+1. Look for the last indexed state via stored checkpoint data.
+ If such a checkpoint exist, then resolve the `NodeState` for that
checkpoint.
+ If no such state exist, or no such checkpoint is present,
+ then it treats it as initial indexing, in which case the base state is
empty.
+ This state is considered the `before` state.
+2. Create a checkpoint for _current_ state and refer to this as `after` state.
+3. Create an `IndexUpdate` instance bound to the current _indexing lane_,
+ and trigger a diff between the `before` and the `after` state.
+4. `IndexUpdate` will then pick up index definitions which are bound to the
current indexing lane,
+ will create `IndexEditor` instances for them,
+ and pass them the diff callbacks.
+5. The diff traverses in a depth-first manner,
+ and at the end of diff, the `IndexEditor` will do final changes for the
current indexing run.
+ Depending on the index implementation, the index data can be either stored
in NodeStore itself
+ (for indexes of type `lucene` and `property`), or in any remote store (for
type `solr`).
+6. `AsyncIndexUpdate` will then update the last indexed checkpoint to the
current checkpoint
+ and do a commit.
+
+Such async indexes are _eventually consistent_ with the repository state,
+and lag behind the latest repository state by some time.
+However the index content is eventually consistent, and never end up in wrong
state with respect
to repository state.
#### <a name="checkpoint"></a> Checkpoint
-Checkpoint is a mechanism whereby a client of NodeStore can request it to
ensure that repository state at that time
-can be preserved and not garbage collected by revision garbage collection
process. Later that state can be retrieved
-back from NodeStore by passing the checkpoint back. You can treat checkpoint
like a named revision or a tag in git
-repo.
+A checkpoint is a mechanism, whereby a client of `NodeStore` can request Oak
to ensure
+that the repository state (snapshot) at that time can be preserved, and not
garbage collected
+by the revision garbage collection process.
+Later, that state can be retrieved from the NodeStore by passing the
checkpoint.
+You think of a checkpoint as a tag in a git repository, or as a named
revision.
Async indexing makes use of checkpoint support to access older repository
state.
#### <a name="indexing-lane"></a> Indexing Lane
-Indexing lane refers to a set of indexes which are to be indexed by given
async indexer. Each index definition meant for
-async indexing defines an `async` property whose value is the name of indexing
lane. For e.g. consider following 2 index
-definitions
+The term indexing lane refers to a set of indexes which are to be updated by a
given async indexer.
+Each index definition meant for async indexing defines an `async` property,
+whose value is the name of the indexing lane.
+For example, consider following 2 index definitions:
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
@@ -172,116 +193,131 @@ definitions
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = "fulltext-async"
-Here _userIndex_ is bound to "async" indexing lane while _assetIndex_ is bound
to "fulltext-async" lane. Oak
-[setup](#async-index-setup) would configure 2 `AsyncIndexUpdate` jobs one for
"async" and one for "fulltext-async".
-When job for "async" would run it would only process index definition where
`async` value is `async` while when job
-for "fulltext-async" would run it would pick up index definitions where
`async` value is `fulltext-async`.
-
-These jobs can be scheduled to run at different intervals and also on
different cluster nodes. Each job would keep its
-own bookkeeping of checkpoint state and can be [paused and
resumed](#async-index-mbean) separately.
-
-Prior to Oak 1.4 there was only one indexing lane `async`. In Oak 1.4 support
was added to create 2 lanes `async` and
-`fulltext-async`. With 1.6 its possible to [create multiple
lanes](#async-index-setup).
+Here, _userIndex_ is bound to the "async" indexing lane,
+while _assetIndex_ is bound to the "fulltext-async" lane.
+Oak [setup](#async-index-setup) configures two `AsyncIndexUpdate` jobs:
+one for "async", and one for "fulltext-async".
+When the job for "async" is run,
+it only processes index definition where the `async` value is `async`,
+while when the job for "fulltext-async" is run,
+it only pick up index definitions where the `async` value is `fulltext-async`.
+
+These jobs can be scheduled to run at different intervals, and also on
different cluster nodes.
+Each job keeps its own bookkeeping of checkpoint state,
+and can be [paused and resumed](#async-index-mbean) separately.
+
+Prior to Oak 1.4, there was only one indexing lane: `async`.
+In Oak 1.4, support was added to create two lanes: `async` and
`fulltext-async`.
+With 1.6, it is possible to [create multiple lanes](#async-index-setup).
#### <a name="cluster"></a> Clustered Setup
-In a clustered setup it needs to be ensured by the host application that async
indexing jobs for specific lanes are to
-be run as singleton in the cluster. If `AsyncIndexUpdate` for same lane gets
executed concurrently on different cluster
-nodes then it can lead to race conditions where old checkpoint gets lost
leading to reindexing of the indexes.
+In a clustered setup, one needs to be ensured in the host application that
+the async indexing jobs for specific lanes are to be run as singleton in the
cluster.
+If `AsyncIndexUpdate` for same lane gets executed concurrently on different
cluster nodes,
+it leads to race conditions, where an old checkpoint gets lost,
+leading to reindexing of the indexes.
-Refer to [clustering](../clustering.html#scheduled-jobs) for more details on
how the host application should schedule
-such indexing jobs
+See also [clustering](../clustering.html#scheduled-jobs)
+for more details on how the host application should schedule such indexing
jobs.
##### <a name="async-index-lease"></a> Indexing Lease
-`AsyncIndexUpdate` has an inbuilt lease logic to ensure that even if the jobs
gets scheduled to run on different cluster
-nodes then also only one of them runs. This is done by keeping a lease
property which gets periodically updated as
+`AsyncIndexUpdate` has an in-built "lease" logic to ensure that
+even if the jobs gets scheduled to run on different cluster nodes, only one of
them runs.
+This is done by keeping a lease property, which gets periodically updated as
indexing progresses.
-An `AsyncIndexUpdate` run would skip indexing if current lease has not expired
i.e. if the last
-update of lease was done long ago (default 15 mins) then it would be assumed
that cluster node doing indexing is not
-available and some other node would take over.
-
-The lease logic can delay start of indexing if the system is not stopped
cleanly. As of Oak 1.6 this does not affect
-non clustered setup like those based on SegmentNodeStore but only [affects
DocumentNodeStore][OAK-5159] based setups
+An `AsyncIndexUpdate` run skip indexing if the current lease has not expired.
+If the last update of the lease was done long ago (default 15 mins),
+then it is assumed that cluster node doing indexing is not available,
+and some other node will take over.
+
+The lease logic can delay the start of indexing if the system is not stopped
cleanly.
+As of Oak 1.6, this does not affect non clustered setups like those based on
SegmentNodeStore,
+but only [affects DocumentNodeStore][OAK-5159] based setups.
#### <a name="async-index-lag"></a> Indexing Lag
-Async indexing jobs are by default configured to run at interval of 5 secs.
Depending on the system load and diff size
-of content to be indexed the indexing may start lagging by longer time
intervals. Due to this the indexing results would
-lag behind the repository state and may become stale i.e. new content added
would show up in result after some time.
+Async indexing jobs are by default configured to run at an interval of 5
seconds.
+Depending on the system load and diff size of content to be indexed,
+the indexing may start lagging by a longer time interval.
+Due to this, the indexing results can lag behind the repository state,
+and may become stale, that is new content added will show up in query results
after some time.
-`IndexStats` MBean keeps a time series and metrics stats for the indexing
frequency. This can be used to track the
-indexing state
+The `IndexStats` MBean keeps a time series and metrics stats for the indexing
frequency.
+This can be used to track the indexing state.
-[NRT Indexing](#nrt-indexing) introduced in Oak 1.6 would help in such
situations and can keep the results more upto
-date
+[NRT Indexing](#nrt-indexing) introduced in Oak 1.6 helps in such situations,
+and can keep the results more up to date.
#### <a name="async-index-setup"></a> Setup
`@since Oak 1.6`
-Async indexers can be configure via OSGi config for
`org.apache.jackrabbit.oak.plugins.index.AsyncIndexerService`
+Async indexers can be configure via the OSGi config for
`org.apache.jackrabbit.oak.plugins.index.AsyncIndexerService`.

-Different lanes can be configured by adding more rows of _Async Indexer
Configs_. Prior to 1.6 the indexers were
-created programatically while constructing Oak.
+Different lanes can be configured by adding more rows of _Async Indexer
Configs_.
+Prior to 1.6, the indexers were created programatically while constructing Oak.
#### <a name="async-index-mbean"></a> Async Indexing MBean
-For each configured async indexer in the setup the indexer exposes a
`IndexStatsMBean` which provides various
-stats around current indexing state.
+For each configured async indexer in the setup, the indexer exposes a
`IndexStatsMBean`,
+which provides various stats around the current indexing state:
org.apache.jackrabbit.oak: async (IndexStats)
org.apache.jackrabbit.oak: fulltext-async (IndexStats)
It provide details like
-* FailingIndexStats - Stats around indexes which are [failing and marked as
corrupt](#corrupt-index-handling)
-* LastIndexedTime - Time upto which repository state has been indexed
-* Status - running, done, failing etc
-* Failing - boolean flag indicating that indexing has been failing due to some
issue. This can be monitored
- for detecting if indexer is healthy or not
-* ExecutionCount - Time series data around when number of execution for
various time intervals
+* FailingIndexStats - Stats around indexes which are [failing and marked as
corrupt](#corrupt-index-handling).
+* LastIndexedTime - Time up to which the repository state has been indexed.
+* Status - running, done, failing etc.
+* Failing - boolean flag indicating that indexing has been failing due to some
issue.
+ This can be monitored for detecting if indexer is healthy or not.
+* ExecutionCount - Time series data around the number of runs for various time
intervals.
Further it provides operations like
-* pause - Pauses the indexer
-* abortAndPause - Aborts any running indexing cycle and pauses the indexer.
Invoke 'resume' once you are ready
- to resume indexing again
-* resume - Resume the indexing
+* pause - Pauses the indexer.
+* abortAndPause - Aborts any running indexing cycle and pauses the indexer.
+ Invoke 'resume' once you are ready to resume indexing again.
+* resume - Resume indexing.
#### <a name="corrupt-index-handling"></a> Isolating Corrupt Indexes
`Since 1.6`
-AsyncIndexerService would now mark any index which fails to update for 30 mins
(configurable) as `corrupt` and
-ignore such indexes from further indexing.
+The `AsyncIndexerService` marks any index which fails to update for 30 mins
+(configurable) as `corrupt`, and ignore such indexes from further indexing.
-When any index is marked as corrupt following log entry would be made
+When any index is marked as corrupt, the following log entry is made:
- 2016-11-22 12:52:35,484 INFO NA [async-index-update-fulltext-async]
o.a.j.o.p.i.AsyncIndexUpdate - Marking
- [/oak:index/lucene] as corrupt. The index is failing since Tue Nov 22
12:51:25 IST 2016 ,1 indexing cycles, failed
- 7 times, skipped 0 time
+ 2016-11-22 12:52:35,484 INFO NA [async-index-update-fulltext-async]
o.a.j.o.p.i.AsyncIndexUpdate -
+ Marking [/oak:index/lucene] as corrupt. The index is failing since Tue Nov
22 12:51:25 IST 2016,
+ 1 indexing cycles, failed 7 times, skipped 0 time
-Post this when any new content gets indexed and any such corrupt index is
skipped then following warn entry would be made
+Post this, when any new content gets indexed and any such corrupt index is
skipped,
+the following warn entry is made:
- 2016-11-22 12:52:35,485 WARN NA [async-index-update-fulltext-async]
o.a.j.o.p.index.IndexUpdate - Ignoring corrupt
- index [/oak:index/lucene] which has been marked as corrupt since
[2016-11-22T12:51:25.492+05:30]. This index MUST be
- reindexed for indexing to work properly
+ 2016-11-22 12:52:35,485 WARN NA [async-index-update-fulltext-async]
o.a.j.o.p.index.IndexUpdate -
+ Ignoring corrupt index [/oak:index/lucene] which has been marked as
corrupt since
+ [2016-11-22T12:51:25.492+05:30]. This index MUST be reindexed for indexing
to work properly
-This info would also be seen in MBean
+This info is also seen in the MBean

-Later once the index is reindexed following log entry would be made
+Later, once the index is reindexed, the following log entry is made
- 2016-11-22 12:56:25,486 INFO NA [async-index-update-fulltext-async]
o.a.j.o.p.index.IndexUpdate - Removing corrupt
- flag from index [/oak:index/lucene] which has been marked as corrupt since
[corrupt = 2016-11-22T12:51:25.492+05:30]
+ 2016-11-22 12:56:25,486 INFO NA [async-index-update-fulltext-async]
o.a.j.o.p.index.IndexUpdate -
+ Removing corrupt flag from index [/oak:index/lucene] which has been marked
as corrupt since
+ [corrupt = 2016-11-22T12:51:25.492+05:30]
-This feature can be disabled by setting `failingIndexTimeoutSeconds` to 0 in
AsyncIndexService config. Refer to
-[OAK-4939][OAK-4939] for more details
+This feature can be disabled by setting `failingIndexTimeoutSeconds` to 0 in
the `AsyncIndexService` config.
+See also [OAK-4939][OAK-4939] for more details.
### <a name="nrt-indexing"></a> Near Real Time Indexing
@@ -289,61 +325,66 @@ This feature can be disabled by setting
_This mode is only supported for `lucene` indexes_
-Lucene indexes perform well for evaluating complex queries and also have the
benefit of being evaluated locally with
-copy-on-read support. However they are `async` index and depending on system
load can lag behind the repository state.
-For cases where such lag (of order of minutes) is not acceptable one has to
use `property` indexes. For such cases
-Oak 1.6 has [added support for near real time indexing][OAK-4412]
+Lucene indexes perform well for evaluating complex queries,
+and also have the benefit of being evaluated locally with copy-on-read
support.
+However, they are `async`, and depending on system load can lag behind the
repository state.
+For cases where such lag (in the order of minutes) is not acceptable,
+one has to use `property` indexes.
+For such cases, Oak 1.6 has [added support for near real time
indexing][OAK-4412]

-In this mode the indexing would happen in 2 modes and query would consult
multiple indexes. The diagram above shows
-indexing flow with time. In above flow
+In this mode, the indexing happen in two modes, and a query will consult
multiple indexes.
+The diagram above shows the indexing flow with time. In the above flow,
* T1, T3 and T5 - Time instances at which checkpoint is created
-* T2 and T4 - Time instance when async indexer run completed and indexes were
updated
+* T2 and T4 - Time instance when async indexer runs completed and indexes were
updated
* Persisted Index
- * v2 - Index version v2 which has repository state upto time T1 indexed
- * v3 - Index version v2 which has repository state upto time T3 indexed
+ * v2 - Index version v2, which has repository state up to time T1 indexed
+ * v3 - Index version v2, which has repository state up to time T3 indexed
* Local Index
- * NRT1 - Local index which repository state between time T2 and T4 indexed
- * NRT2 - Local index which repository state between time T4 and T6 indexed
+ * NRT1 - Local index, which has repository state between time T2 and T4
indexed
+ * NRT2 - Local index, which has repository state between time T4 and T6
indexed
-As repository state changes with time Async indexer would run and index state
between last known checkpoint and
-current state when that run started. So when asyn run 1 completed the
persisted index has repository state indexed
-upto time T3.
-
-Now without NRT index support if any query is performed between time T2 and T4
it would only see index result for
-repository state at time T1 as thats state which the persisted indexes have
data for. Any change after that would not be
-seen untill next async indexing cycle complete (by time T4).
-
-With NRT indexing support indexing would happen at 2 places
-
-* Persisted Index - This is the index which is updated via async indexer run.
This flow would remain same i.e. it
- would be periodically updated by the indexer run
-* Local Index - In addition to persisted index each cluster node would also
maintain a local index. This index would
- only keep data between 2 async indexer run. Post each run the previous index
would be discarded and a new index would
- be built (actually previous index is retained for one cycle)
+As the repository state changes with time, the Async indexer will run and
index the
+state between last known checkpoint and current state when that run started.
+So when asyncc run 1 completed, the persisted index has the repository state
indexed up to time T3.
+
+Now without NRT index support, if any query is performed between time T2 and
T4,
+it can only see index result for repository state at time T1,
+as thats the state where the persisted indexes have data for.
+Any change after that can not be seen until the next async indexing cycle is
complete (by time T4).
+
+With NRT indexing support indexing will happen at two places:
+
+* Persisted Index - This is the index which is updated via the async indexer
run.
+ This flow remains the same, it will be periodically updated by the indexer
run.
+* Local Index - In addition to persisted index, each cluster node will also
maintain a local index.
+ This index only keeps data between two async indexer runs.
+ Post each run, the previous index is discarded, and a new index is built
+ (actually the previous index is retained for one cycle).
-Any query making use of such an index would make use of both indexes. With
this new content added in repository
-after the last async index run would also show up quickly.
+Any query making use of such an index will automatically make use of both the
persisted and the local indexes.
+With this, new content added in the repository after the last async index run
will also show up quickly.
#### <a name="nrt-indexing-modes"></a> NRT Indexing Modes
-NRT indexing can be enabled for any index by configuring the `async` property
+NRT (Near real time) indexing can be enabled for any index by configuring the
`async` property:
/oak:index/assetIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = ['fulltext-async', 'nrt']
-Here `async` value has been set to a multi value property where
+Here, the `async` value has been set to a multi-valued property, with the
-* Indexing lane - Like `async` or `fulltext-async`
-* NRT Indexing Mode - `nrt` or `sync`
+* Indexing lane - For example `async` or `fulltext-async`,
+* NRT Indexing Mode - `nrt` or `sync`.
##### <a name="nrt-indexing-mode-nrt"></a> nrt
-In this mode the local index would be updated asynchronously on that cluster
nodes post commit and the index reader
-would be refreshed after 1 sec. So any change done should should show up on
that cluster node in 1-2 secs
+In this mode, the local index is updated asynchronously on that cluster nodes
post each commit,
+and the index reader is refreshed each second.
+So any change done should should show up on that cluster node within 1 to 2
seconds.
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
@@ -351,73 +392,81 @@ would be refreshed after 1 sec. So any c
##### <a name="nrt-indexing-mode-sync"></a> sync
-In this mode the local index would be updated synchronously on that cluster
nodes post commit and the index reader
-would be refreshed immediately. This mode performs slowly compared to the
"nrt" mode
+In this mode, the local index is updated synchronously on that cluster nodes
post each commit,
+and the index reader is refreshed immediately.
+This mode performs more slowly compared to the "nrt" mode.
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = ['async', 'sync']
-For a single node setup (like with SegmentNodeStore) this mode effectively
makes async lucene index perform same as
-synchronous property indexes. However 'nrt' mode performs better so using that
would be preferable
-
+For a single node setup (for example with the `SegmentNodeStore`),
+this mode effectively makes async lucene index perform same as synchronous
property indexes.
+However, the 'nrt' mode performs better, so using that is preferable.
+
#### <a name="nrt-indexing-cluster-setup"></a> Cluster Setup
-In cluster setup each cluster node would maintain its own local index for
changes happening in that cluster node.
-In addition to that it would also index changes from other cluster node by
relying on [Oak observation for external
-changes][OAK-4808]. This depends on how frequently external changes are
delivered. Due to this even with NRT indexing
-changes from other cluster node would take some more time to reflect in query
result compared to local changes.
+In cluster setup, each cluster node maintains its own local index for changes
happening in that cluster node.
+In addition to that, it also indexes changes from other cluster node by
relying on
+[Oak observation for external changes][OAK-4808].
+This depends on how frequently external changes are delivered.
+Due to this, even with NRT indexing changes from other cluster nodes will take
some more time
+to be reflected in query results compared to local changes.
#### <a name="nrt-indexing-config"></a> Configuration
-NRT indexing expose few configuration options as part of
[LuceneIndexProviderService](lucene.html#osgi-config)
+NRT indexing expose a few configuration options as part of the
[LuceneIndexProviderService](lucene.html#osgi-config):
+
+* `enableHybridIndexing` - Boolean property, defaults to `true`.
+ Can be set to `false` to disable the NRT indexing feature completely.
+* `hybridQueueSize` - The size of the in memory queue used
+ to hold Lucene documents for indexing in the `nrt` mode.
+ The default size is 10000.
-* `enableHybridIndexing` - Boolean property defaults to `true`. Can be set to
`false` to disable NRT indexing feature
- completely
-* `hybridQueueSize` - Size of in memory queue used to hold Lucene documents
for indexing in `nrt` mode. Default size is
- 10000
-
## <a name="reindexing"></a> Reindexing
-Reindexing of existing indexes is required in following scenarios
+Reindexing of existing indexes is required in the following scenarios:
-* Incompatible change in index definition - For example adding properties to
the index which is already
- present in repository
-* Corrupted Index - If the index is corrupt and `AsyncIndexUpdate` run fails
with exception pointing to index being
- corrupt
+* Incompatible changes in the index definition -
+ For example adding properties to the index which is already
+ present in repository.
+* Corrupted Index - If the index is corrupt and `AsyncIndexUpdate` run fails
+ with an exception pointing to index being corrupt.
-Reindexing does not resolve other problems, such that queries not returning
data. For such cases, it is _not_
-recommended to reindex (also because this can be very slow and use a lot of
temporary disk space).
+Reindexing does not resolve other problems, such that queries not returning
data.
+For such cases, it is _not_ recommended to reindex (also because this can be
very slow and use a lot of temporary disk space).
If queries don't return the right data, then possibly the index is [not yet
up-to-date][OAK-5159],
-or the query is incorrect, or included/excluded path settings are wrong (for
Lucene indexes). Instead of reindexing, it
-is suggested to first check the log file, modify the query so it uses a
different index or traversal and run the query again.
+or the query is incorrect, or included/excluded path settings are wrong (for
Lucene indexes).
+Instead of reindexing, it is suggested to first check the log file,
+modify the query so it uses a different index or traversal and run the query
again.
One case were reindexing can help is if the query engine picks a very slow
index for some queries because the counter index
[got out of sync after adding and removing lots of nodes many times (fixed in
recent version)][OAK-4065].
For this case, it is recommended to verify the contents of the counter index
first,
and upgrade Oak before reindexing.
-Also note that with Oak 1.6 for Lucene indexes changes in index definition are
only effective
-[post reindexing](lucene.html#stored-index-definition)
+Also note that with Oak 1.6, for Lucene indexes, changes in the index
definition are only effective
+[post reindexing](lucene.html#stored-index-definition).
-To reindex any index set the `reindex` flag to `true` in index definition
+To reindex any index, set the `reindex` flag to `true` in index definition:
/oak:index/userIndex
- jcr:primaryType = "oak:QueryIndexDefinition"
- async = ['async']
- reindex = true
-Once changes are saved the index would be reindexed. For synchronous indexes
the reindexing would be done
-as part of save (or commit) itself. While for asynchronous indexes are
reindexed whenever the next async
-indexing cycle run happens. Once reindexing starts following log entries can
be seen in the log
+Once changes are saved, the index is reindexed. For synchronous indexes,
+the reindexing is done as part of save (or commit) itself.
+While for asynchronous indexes, reindex starts with the next async indexing
cycle.
+Once reindexing starts, the following log entries can be seen in the log:
[async-index-update-async] o.a.j.o.p.i.IndexUpdate Reindexing will be
performed for following indexes: [/oak:index/userIndex]
[async-index-update-async] o.a.j.o.p.i.IndexUpdate Reindexing Traversed
#100000 /home/user/admin
[async-index-update-async] o.a.j.o.p.i.AsyncIndexUpdate [async] Reindexing
completed for indexes: [/oak:index/userIndex*(4407016)] in 30 min
-
-In both cases once reindexing is complete the `reindex` flag would be removed.
-For property index you can also make use of `PropertyIndexAsyncReindexMBean`.
Refer to
-[reindeinxing property indexes](property-index.html#reindexing) section for
more details on that
+In both cases, once reindexing is complete, the `reindex` flag is removed.
+
+For a property index, you can also make use of the
`PropertyIndexAsyncReindexMBean`.
+See also the [reindeinxing property indexes](property-index.html#reindexing)
section for more details on that.
[OAK-5159]: https://issues.apache.org/jira/browse/OAK-5159