Re: [DISCUSS] - QueryIndex selection

Thomas Mueller Wed, 04 Jun 2014 00:38:19 -0700

Hi,

QueryIndex.getCost: this is actually quite well documented (see the
Javadocs). But the implementations might not fully follow the contract :-)
But anyway, I think it's anyway the better to deprecate it and use
AdvancedQueryIndex, as it has more features (specially important for
ordered indexes). Currently, both QueryIndex and AdvancedQueryIndex are
supported, in the future I hope we can switch all index implementations to
AdvancedQueryIndex. I didn't want to do that just before the 1.0 release
however.

> get rid for example of the FullTextQueryIndex interface

The FullTextQueryIndex allows to chose a full-text index for full-text
queries, if one is available, even if the cost is higher. The problem is
that only full-text indexes can return the correctly data, if full-text
constraints are used. If we want to get rid of the FullTextQueryIndex
interface, we need to address this problem in some other way.

> should we always select the fastest index ? Especially for full text
>ones this should be in some way configurable.

There are multiple aspects to this: (a) "let the user decide which index
to use", and (b) "synchronous versus async indexes":

(a) The user should be able to decide which index to use for a certain
query. There are some problems with that: The index the user has in mind
might not be available (in a certain environment, or with a later version
of Oak, because for example the index implementation was replaced, or when
not using Oak). Hardcoding the index to use (in the query) is problematic.
Relational databases: Oracle supports "hints" (search for "oracle database
hint"). SQLite supports something similar, but there it's actually an
assertion that an index is available, not a hint:
http://www.sqlite.org/lang_indexedby.html . My position is that we should
avoid such a mechanism, and instead improve the query engine, the indexes
implementations, and the documentation instead.

(b) Synchronous versus async indexes: some indexes are updated
asynchronously, and therefore will not include the very latest additions
to the content. (Recently removed nodes are not a problem, as the query
engine will anyway have to check if a node is available). We could let the
user decide if using an asynchronous index is OK or not.

For both (a) and (b), one problem is that the JCR spec doesn't allow for
extensions in the query syntax, so if the user would use "select ...
option async_ok", the query would not work with Jackrabbit 2.x and other
JCR implementations. Maybe we should create a JCR commons utility method,
so that one could use:

    QueryManager qm = ...

    Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK);

The method JcrUtils.createQuery could then use "instanceof" to decide
whether it's OK to modify the query (for Oak) or not (non-Oak).

> for full text queries for example, one may be interested in having a
>higher recall (more documents matching the query) which may eventually
>lead to a slightly slower query execution / higher cost evaluation

That would also need to be specified by the user in some way, right? For
example in the query itself? We could use a similar mechanism than
ASYNC_OK above, so the application would still work for Jackrabbit 2.x.

Regards,
Thomas

On 26/05/14 10:25, "Tommaso Teofili" <[email protected]> wrote:

>Hi all,
>
>I'd like to start discussing how we may improve / simplify current way of
>selecting a query engine to use for a certain query.
>
>In the QueryIndex interface we have the plain old getCost method which
>selects the index returning the lower cost for the given query but,
>recently, also an AdvancedQueryIndex interface has been introduced which,
>if I understood things correctly, uses the IndexPlan(s) returned by each
>query index for the given query to select which one has to be used.
>So I would like to discuss if it's possible to clean up things a bit in
>order to have a unified query selection mechanism.
>
>At the moment, in my opinion, one problem with the getCost() method is
>that
>it inherently merges the following topics:
>- index capability to handle a certain query (can the QueryIndex handle
>that query?)
>- index efficiency in handling a certain query (how fast will the
>QueryIndex will be in handling that query?)
>
>Also the efficiency is not evaluated on a "cost model", each QueryIndex
>implementation can return an arbitrary different number; on one hand this
>is ok as it allows to take very index specific constraint into account: on
>the other hand if one has to write a new QueryIndex implementation he/she
>will have to look into each other query index implementation to understand
>(and design) if / when its index is picked up; and even with already
>existing indexes it's not easy to say upfront which one will be selected
>(e.g. for debugging purposes).
>
>With the AdvancedQueryIndex, if I understood it correctly (I just had a
>look at it on Friday), a QueryIndex is selected upon its IndexPlan, which
>is supposed to address better both the cost (as it explicitly exposes the
>cost per execution, cost per entry and estimated entry count metrics) and
>the query index capability to handle a certain query (e.g. this is used
>for
>ordered property index).
>However, at the moment, only the OrderedPropertyIndex is using it so I
>think it'd be good to decide if we want to go further with the
>AdvancedQueryIndex also for the other QueryIndex implementations (and get
>rid for example of the FullTextQueryIndex interface as it seems useless to
>me) or not.
>
>One final question on query index selection, should we always select the
>fastest index ?
>Especially for full text ones this should be in some way configurable.
>
>What do others think?
>Regards,
>Tommaso
>
>p.s.:
>As discussed also offline last week with some other folks maybe one
>further
>metric to be taken into consideration for the index selection is if the
>index is synchronous or not

Re: [DISCUSS] - QueryIndex selection

Reply via email to