[ 
https://issues.apache.org/jira/browse/OAK-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra updated OAK-1966:
---------------------------------

    Attachment: out.csv

Attached is a csv file which has been extracted from mongod.log

{noformat}
$ mlogfilter mongod.log --from "now -3days" > mongo-3days.log
$ mloginfo mongo-3days.log --queries
     source: mongo-3days.log
      start: 2014 Jul 11 20:21:54
        end: 2014 Jul 14 10:31:03
date format: iso8601-local
     length: 5061
     binary: unknown
    version: >= 2.6 (iso8601 format)

QUERIES
                                                                                
          
namespace           pattern                       count    min (ms)    max (ms) 
   mean (ms)    95%-ile (ms)    sum (ms)

aem-author.nodes    {"_id": 1, "_modified": 1}      432         101      859581 
        6907          2784.7     2984000
aem-author.nodes    {"_id": 1}                      694         101         552 
         167           277.0      116198
aem-author.blobs    {"_id": 1}                        2         101         175 
         138           171.3         276
$ cat mongo-3days.log |grep -F 'query aem-author.nodes query: { $query: { _id: 
{ $gt:'| grep -F '_modified' | mfilter --json > mongo-3days-json.log
{noformat}

>From the json dump the data was extracted via script. The CSV contains 
>following cols
# Delta between _modified and time of query
# nscanned
# Index used
# Result count
# Timestamp
# Path whoese children were queried for

Key points to observe
# ~70% queries were made with duration less than 60 sec
# For some queries the delta was quite high. 
# For those high durations an index scan would still traverse over 700k entries 
and get only 10-20 results which satisfy both conditions. Though time taken is 
less
# Such scans in almost all cases are happening for index data and that too uuid 
type where we expect very flat hierarchies. So child count would be high

> Add Hint for selecting more performant index in MongoDocumentStore#query 
> -------------------------------------------------------------------------
>
>                 Key: OAK-1966
>                 URL: https://issues.apache.org/jira/browse/OAK-1966
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: mongomk
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.1, 1.0.3
>
>         Attachments: out.csv
>
>
> In MongoDocumentStore#query we make a call like
> bq. db.nodes.find({ _id: { $gt: "3:/content/foo/01/", $lt: 
> "3:/content/foo010" }, _modified: { $gte: 1405085300 } }).sort({_id:1})
> Further we have indexes
> * {_id : 1}
> * {_modified : -1}
> Depending on scenario one of the two index would perform better
> * If very few changes have happened in the time interval then {{_modified}} 
> would perform better
> * In other cases {{_id}} index would perform better
> Ideally this should be decided by Mongo Query planner but it seems that at 
> times it is not making the right choice. For example getting plan for the 
> query with same shape yields following results 
> {noformat}
> A - planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:19046849 
> nscannedObjects:19046849 scanAndOrder:1 keyUpdates:0 numYields:41359 
> locks(micros) r:103894585 nreturned:2 reslen:1173 61334ms
> B - 2014-07-10T15:59:19.994-0400 [conn1365] query aem-author.nodes query: 
> {$query: { _id:{ $gt: "3:/content/foo/01/", $lt: "3:/content/foo010" }, 
> _modified:{ $gte: 1404245090 }}, $orderby:{ _id: 1 }} planSummary: IXSCAN{ 
> _modified: -1 } ntoreturn:0 ntoskip:0 nscanned:15626664 
> nscannedObjects:15626664 scanAndOrder:1 keyUpdates:0 numYields:324016 
> locks(micros) r:453960384 nreturned:1 reslen:582 972125ms
> C - 2014-07-11T15:22:44.579-0400 [conn1387] query aem-author.nodes query: { 
> $query: { _id: { $gt: "4:/oak:index/uuid/:index/", $lt: 
> "4:/oak:index/uuid/:index0" }, _modified: { $gte: 1405106530 } }, $orderby: { 
> _id: 1 }, $hint: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:0 
> ntoskip:0 nscanned:701631 nscannedObjects:701631 keyUpdates:0 numYields:42 
> locks(micros) r:4471112 nreturned:17 reslen:6557 2540ms 
> {noformat}
> So Mongo used a BasicCursor, _id index, _modified index in different runs. 
> Now lets see whats the difference between time of query and _modified and 
> nscanned
> * B - 15626664, 8 days, _modified - Should have used _id index as duration is 
> too large 
> * C - 701631, 34 sec, _id - Might have used modified index as duration to 
> check for is less
> Mongo 2.6 uses heuristics to determine which plan to use. As mentioned in 
> [SERVER-13866|https://jira.mongodb.org/browse/SERVER-13866]
> bq. Some background: to choose a query plan to use for a given query when 
> multiple candidate plans exist, the query engine runs each candidate plan and 
> then picks the plan that produced the most results during a trial period on a 
> subset of the data to be scanned. The winning query plan is then cached, and 
> used for subsequent queries of the same shape until the cache entry is 
> invalidated (which happens under certain conditions, such as when the data 
> distribution in the collection changes sufficiently or when the chosen query 
> plan performs consistently much worse than it did during initial selection).
> So at times Mongo might make a right guess at times not! So we need to 
> determine ways such that right index is used by Mongo to execute a given query



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to