[
https://issues.apache.org/jira/browse/OAK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davide Giannella updated OAK-6513:
----------------------------------
Fix Version/s: 1.9.11
> Journal based Async Indexer
> ---------------------------
>
> Key: OAK-6513
> URL: https://issues.apache.org/jira/browse/OAK-6513
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: indexing
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Priority: Major
> Fix For: 1.10, 1.9.10, 1.9.11
>
>
> Current async indexer design is based on NodeState diff. This has served us
> fine so far however off late it is not able to perform well if rate of
> repository writes is high. When changes happen faster than index-update can
> process them, larger and larger diffs will happen. These make index-updates
> slower, which again lead to the next diff being ever larger than the one
> before (assuming a constant ingestion rate).
> In current diff based flow the indexer performs complete diff for all changes
> happening between 2 cycle. It may happen that lots of writes happens but not
> much indexable content is written. So doing diff there is a wasted effort.
> In 1.6 release for NRT Indexing we implemented a journal based indexing for
> external changes(OAK-4808, OAK-5430). That approach can be generalized and
> used for async indexing.
> Before talking about the journal based approach lets see how IndexEditor work
> currently
> h4. IndexEditor
> Currently any IndexEditor performs 2 tasks
> # Identify which node is to be indexed based on some index definition. The
> Editor gets invoked as part of content diff where it determines which
> NodeState is to be indexed
> # Update the index based on node to be indexed
> For e.g. in oak-lucene we have LuceneIndexEditor which identifies the
> NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene
> Document from NodeState to be indexed. For journal based approach we can
> decouple these 2 parts and thus have
> * IndexEditor - Identifies which all paths need to be indexed for given index
> definition
> * IndexUpdater - Updates the index based on given NodeState and its path
> h4. High Level Flow
> # Session Commit Flow
> ## Each index type would provide a IndexEditor which would be invoked as part
> of commit (like sync indexes). These IndexEditor would just determine which
> paths needs to be indexed.
> ## As part of commit the paths to be indexed would be written to journal.
> # AsyncIndexUpdate flow
> ## AsyncIndexUpdate would query this journal to fetch all such indexed paths
> between the 2 checkpoints
> ## Based on the index path data it would invoke the {{IndexUpdater}} to
> update the index for that path
> ## Merge the index updates
> h4. Benefits
> Such a design would have following impact
> # More work done as part of write
> # Marking of indexable content is distributed hence at indexing time lesser
> work to be done
> # Indexing can progress in batches
> # The indexers can be called in parallel
> h4. Journal Implementation
> DocumentNodeStore currently has an in built journal which is being used for
> NRT Indexing. That feature can be exposed as an api.
> For scaling index this design is mostly required for cluster case. So we can
> possibly have both indexing support implemented and use the journal based
> support for DocumentNodeStore setups. Or we can look into implementing such a
> journal for SegmentNodeStore setups also
> h4. Open Points
> * Journal support in SegmentNodeStore
> * Handling deletes.
> Detailed proposal -
> https://wiki.apache.org/jackrabbit/Journal%20based%20Async%20Indexer
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)