Hi Team, As the use of async index like lucene is growing we would need to account for delay in showing updated result due to async nature of indexing. Depending on system load the asyn indexer might lag behind the latest state by some margin. We have improved quite a bit in terms of performance but by design there would be a lag and with load that lag would increase at times.
For e.g. a typical flow in content authoring involves the user uploading some asset to application. And after uploading the asset he goes to the authoring view and look for that uploaded asset via content finder kind of ui. That ui relies on query to show the available assets. Due to delay introduced by async indexer it would take some time (10-15 sec) To account for that we can go for a near real time (NRT*) in memory indexing which would complement the actual persisted async indexer and would exploit the fact the request from same user in a give session would most likely hit same cluster node. Below is brief proposal - This would require changes in layer above in Oak but for now focus is on feasibility. Proposal ======= A - Indexing Side ------------------ The Lucene index can be configured to support NRT mode. If this mode is enabled then on each cluster node we would perform AsyncIndex only for local changes. For such indexer LuceneIndexEditor would use a RAMDirectory. This directory would only have *recently* modified/added documents. B - Query Side --------------- On Query side the LucenePropertyIndex would perform search against two IndexSearcher 1. IndexSearcher based on persisted OakDirectory 2. IndexSearcher obtained from the current active IndexWrite used with RAMDirectory Query would be performed against both and a merged cursor [2] would be returned back C - Benefits ---------------- This approach would allow the user to at least see his modifications appear quickly in search results and would make the search results accuracy more deterministic. This feature need not be enabled globally but can be enabled on per index basis. Based on business requirement D- Challenges ------------------- 1. Ensuring that RAMDirectory is bounded and only contain recently modified documents. The lower limit can be based on last indexed time from AsyncIndexer. Periodically we would need to prune old documents from this RAMDirectory 2. IndexUpdate would need to be adapted to support this hybrid model for same index type - So something to be looked into Thoughts? Chetan Mehrotra NRT - Near real Time is technically a Lucene term https://wiki.apache.org/lucene-java/NearRealtimeSearch. However using here as approach is bit similar! [2] Such a merged cursor and performing query against multiple searcher would anyway be required to support zero downtime kind of requirement where index content would be split across local and global instance