[ 
https://issues.apache.org/jira/browse/OAK-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Marth updated OAK-1970:
-------------------------------
    Labels: performance resilience  (was: )

> Optimize the diff logic for large number of children case 
> ----------------------------------------------------------
>
>                 Key: OAK-1970
>                 URL: https://issues.apache.org/jira/browse/OAK-1970
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: mongomk
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance, resilience
>             Fix For: 1.3.0
>
>
> DocumentNodeStore currently makes use of query to determine child nodes which 
> have changed after certain time. Query used is something like
> {noformat}
> db.nodes.find({ _id: { $gt: "3:/content/foo/01/", $lt: "3:/content/foo010" }, 
> _modified: { $gte: <start time> } }).sort({_id:1})
> {noformat}
> OAK-1966 tries to optimize the majority case where start times is recent and 
> in that case it makes use of _modified index. However if the start time is 
> quite old and a node has large number of children say 100k then it would 
> involve scan of all those 100k nodes as _modified index would not be of much 
> help. 
> Instead of querying like this we can have a special handling for cases where 
> large number of children are involved. It would involve following steps
> After analyzing the runtime queries in most case it is seen that even with 
> old modified time the number of change nodes is < 50
> # Mark parent nodes which have large number of children say > 50
> # On such nodes we would keep an array of \{modifiedtime, childName\} ## 
> Array would be bounded say keep last 50 updates. This can be done via splice 
> and push operators [1]
> ## Each entry in array would record modifiedtime and name of child node which 
> was modified. 
> ## Array would be sorted on modifiedtime
> # Each updated to any child belonging to such parent would also involve 
> update to above array
> # When we query for modified we check if the parent has such an array (if 
> parent is in cache) and if that array has time entries from the required 
> start time we directly make use of that and avoid the query
> This should reduce needs for such queries in majority of cases
> [1] http://docs.mongodb.org/manual/reference/operator/update-array/
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to