Hi,

I looked a bit into how MongoDB selects indexes (query plans) and think we 
could take some inspiration.

So, the way MongoDB does it afaiu:
* query gets parsed into Abstract Syntax Tree (so that parameters can get 
stripped out)
* the first time this query is performed then the query is executed against 
*all* available indexes
* the fastest index is put into a cache, so that when the same query 
(abstracted, regardless of parameters) comes in, then only that fastest index 
will be used (will be looked up from cache)
* after a number of modifications that index-selection-cache is flushed. 
Process starts at beginning.

What I dislike about this process is that the first query puts a lot more into 
the system (due to the fact that all indexes perform the query). Moreover, the 
first execution of that query could be disturbed by noise, so the selection 
could be wrong.

What I like, though, (if we ignore the noise issue from above) is that the 
selected index is the one that has actually proven to be the fastest.

So, for Oak: maybe we could enhance the deterministic selection process we have 
right now. We could run queries in the background to determine if the cost 
factors that the indexers claim to have are actually correct (and if not, 
correct them in the query engine). Those background queries could be the ones 
“most often executed” by users on that repo that have multiple indexes capable 
of answering the query.

Consider such a scenario: you have the same nodes indexed in the local property 
index (on the same machine that also serves requests) and a remote SolrCloud 
cluster. If we only reason about index size etc then we can never account for 
the fact that the local machine’s index might be much slower than those 
external machines that are used exclusively for answering queries. We could 
though, if we actually run those queries a number of times on both indexes.

Cheers
Michael


Reply via email to