Hi Team,

As the use of async index like lucene is growing we would need to
account for delay in showing updated result due to async nature of
indexing. Depending on system load the asyn indexer might lag behind
the latest state by some margin. We have improved quite a bit in terms
of performance but by design there would be a lag and with load that
lag would increase at times.

For e.g. a typical flow in content authoring involves the user
uploading some asset to application. And after uploading the asset he
goes to the authoring view and look for that uploaded asset via
content finder kind of ui. That ui relies on query to show the
available assets. Due to delay introduced by async indexer it would
take some time (10-15 sec)

To account for that we can go for a near real time (NRT*) in memory
indexing which would complement the actual persisted async indexer and
would exploit the fact the request from same user in a give session
would most likely hit same cluster node.

Below is brief proposal - This would require changes in layer above in
Oak but for now focus is on feasibility.

Proposal
=======

A - Indexing Side
------------------

The Lucene index can be configured to support NRT mode. If this mode
is enabled then on each cluster node we would perform AsyncIndex only
for local changes. For such indexer LuceneIndexEditor would use a
RAMDirectory. This directory would only have *recently* modified/added
documents.

B - Query Side
---------------

On Query side the LucenePropertyIndex would perform search against two
IndexSearcher

1. IndexSearcher based on persisted OakDirectory
2. IndexSearcher obtained from the current active IndexWrite used with
RAMDirectory

Query would be performed against both and a merged cursor [2] would be
returned back

C - Benefits
----------------

This approach would allow the user to at least see his modifications
appear quickly in search results and would make the search results
accuracy more deterministic.

This feature need not be enabled globally but can be enabled on per
index basis. Based on business requirement

D- Challenges
-------------------
1. Ensuring that RAMDirectory is bounded and only contain recently
modified documents. The lower limit can be based on last indexed time
from AsyncIndexer. Periodically we would need to prune old documents
from this RAMDirectory

2. IndexUpdate would need to be adapted to support this hybrid model
for same index type - So something to be looked into

Thoughts?

Chetan Mehrotra

NRT - Near real Time is technically a Lucene term
https://wiki.apache.org/lucene-java/NearRealtimeSearch. However using
here as approach is bit similar!

[2] Such a merged cursor and performing query against multiple
searcher would anyway be required to support zero downtime kind of
requirement where index content would be split across local and global
instance

Reply via email to