[ https://issues.apache.org/jira/browse/OAK-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Marth updated OAK-1970: ------------------------------- Labels: performance resilience (was: ) > Optimize the diff logic for large number of children case > ---------------------------------------------------------- > > Key: OAK-1970 > URL: https://issues.apache.org/jira/browse/OAK-1970 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: mongomk > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Labels: performance, resilience > Fix For: 1.3.0 > > > DocumentNodeStore currently makes use of query to determine child nodes which > have changed after certain time. Query used is something like > {noformat} > db.nodes.find({ _id: { $gt: "3:/content/foo/01/", $lt: "3:/content/foo010" }, > _modified: { $gte: <start time> } }).sort({_id:1}) > {noformat} > OAK-1966 tries to optimize the majority case where start times is recent and > in that case it makes use of _modified index. However if the start time is > quite old and a node has large number of children say 100k then it would > involve scan of all those 100k nodes as _modified index would not be of much > help. > Instead of querying like this we can have a special handling for cases where > large number of children are involved. It would involve following steps > After analyzing the runtime queries in most case it is seen that even with > old modified time the number of change nodes is < 50 > # Mark parent nodes which have large number of children say > 50 > # On such nodes we would keep an array of \{modifiedtime, childName\} ## > Array would be bounded say keep last 50 updates. This can be done via splice > and push operators [1] > ## Each entry in array would record modifiedtime and name of child node which > was modified. > ## Array would be sorted on modifiedtime > # Each updated to any child belonging to such parent would also involve > update to above array > # When we query for modified we check if the parent has such an array (if > parent is in cache) and if that array has time entries from the required > start time we directly make use of that and avoid the query > This should reduce needs for such queries in majority of cases > [1] http://docs.mongodb.org/manual/reference/operator/update-array/ > -- This message was sent by Atlassian JIRA (v6.3.4#6332)