[ 
https://issues.apache.org/jira/browse/OAK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davide Giannella updated OAK-6513:
----------------------------------
    Fix Version/s:     (was: 1.12.0)

> Journal based Async Indexer
> ---------------------------
>
>                 Key: OAK-6513
>                 URL: https://issues.apache.org/jira/browse/OAK-6513
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Current async indexer design is based on NodeState diff. This has served us 
> fine so far however off late it is not able to perform well if rate of 
> repository writes is high. When changes happen faster than index-update can 
> process them, larger and larger diffs will happen. These make index-updates 
> slower, which again lead to the next diff being ever larger than the one 
> before (assuming a constant ingestion rate). 
> In current diff based flow the indexer performs complete diff for all changes 
> happening between 2 cycle. It may happen that lots of writes happens but not 
> much indexable content is written. So doing diff there is a wasted effort.
> In 1.6 release for NRT Indexing we implemented a journal based indexing for 
> external changes(OAK-4808, OAK-5430). That approach can be generalized and 
> used for async indexing. 
> Before talking about the journal based approach lets see how IndexEditor work 
> currently
> h4. IndexEditor 
> Currently any IndexEditor performs 2 tasks
> # Identify which node is to be indexed based on some index definition. The 
> Editor gets invoked as part of content diff where it determines which 
> NodeState is to be indexed
> # Update the index based on node to be indexed
> For e.g. in oak-lucene we have LuceneIndexEditor which identifies the 
> NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene 
> Document from NodeState to be indexed. For journal based approach we can 
> decouple these 2 parts and thus have 
> * IndexEditor - Identifies which all paths need to be indexed for given index 
> definition
> * IndexUpdater - Updates the index based on given NodeState and its path
> h4. High Level Flow
> # Session Commit Flow
> ## Each index type would provide a IndexEditor which would be invoked as part 
> of commit (like sync indexes). These IndexEditor would just determine which 
> paths needs to be indexed. 
> ## As part of commit the paths to be indexed would be written to journal. 
> # AsyncIndexUpdate flow
> ## AsyncIndexUpdate would query this journal to fetch all such indexed paths 
> between the 2 checkpoints
> ## Based on the index path data it would invoke the {{IndexUpdater}} to 
> update the index for that path
> ## Merge the index updates
> h4. Benefits
> Such a design would have following impact
> # More work done as part of write
> # Marking of indexable content is distributed hence at indexing time lesser 
> work to be done
> # Indexing can progress in batches 
> # The indexers can be called in parallel
> h4. Journal Implementation
> DocumentNodeStore currently has an in built journal which is being used for 
> NRT Indexing. That feature can be exposed as an api. 
> For scaling index this design is mostly required for cluster case. So we can 
> possibly have both indexing support implemented and use the journal based 
> support for DocumentNodeStore setups. Or we can look into implementing such a 
> journal for SegmentNodeStore setups also
> h4. Open Points
> * Journal support in SegmentNodeStore
> * Handling deletes. 
> Detailed proposal - 
> https://wiki.apache.org/jackrabbit/Journal%20based%20Async%20Indexer



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to