Hi, QueryIndex.getCost: this is actually quite well documented (see the Javadocs). But the implementations might not fully follow the contract :-) But anyway, I think it's anyway the better to deprecate it and use AdvancedQueryIndex, as it has more features (specially important for ordered indexes). Currently, both QueryIndex and AdvancedQueryIndex are supported, in the future I hope we can switch all index implementations to AdvancedQueryIndex. I didn't want to do that just before the 1.0 release however.
> get rid for example of the FullTextQueryIndex interface The FullTextQueryIndex allows to chose a full-text index for full-text queries, if one is available, even if the cost is higher. The problem is that only full-text indexes can return the correctly data, if full-text constraints are used. If we want to get rid of the FullTextQueryIndex interface, we need to address this problem in some other way. > should we always select the fastest index ? Especially for full text >ones this should be in some way configurable. There are multiple aspects to this: (a) "let the user decide which index to use", and (b) "synchronous versus async indexes": (a) The user should be able to decide which index to use for a certain query. There are some problems with that: The index the user has in mind might not be available (in a certain environment, or with a later version of Oak, because for example the index implementation was replaced, or when not using Oak). Hardcoding the index to use (in the query) is problematic. Relational databases: Oracle supports "hints" (search for "oracle database hint"). SQLite supports something similar, but there it's actually an assertion that an index is available, not a hint: http://www.sqlite.org/lang_indexedby.html . My position is that we should avoid such a mechanism, and instead improve the query engine, the indexes implementations, and the documentation instead. (b) Synchronous versus async indexes: some indexes are updated asynchronously, and therefore will not include the very latest additions to the content. (Recently removed nodes are not a problem, as the query engine will anyway have to check if a node is available). We could let the user decide if using an asynchronous index is OK or not. For both (a) and (b), one problem is that the JCR spec doesn't allow for extensions in the query syntax, so if the user would use "select ... option async_ok", the query would not work with Jackrabbit 2.x and other JCR implementations. Maybe we should create a JCR commons utility method, so that one could use: QueryManager qm = ... Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK); The method JcrUtils.createQuery could then use "instanceof" to decide whether it's OK to modify the query (for Oak) or not (non-Oak). > for full text queries for example, one may be interested in having a >higher recall (more documents matching the query) which may eventually >lead to a slightly slower query execution / higher cost evaluation That would also need to be specified by the user in some way, right? For example in the query itself? We could use a similar mechanism than ASYNC_OK above, so the application would still work for Jackrabbit 2.x. Regards, Thomas On 26/05/14 10:25, "Tommaso Teofili" <[email protected]> wrote: >Hi all, > >I'd like to start discussing how we may improve / simplify current way of >selecting a query engine to use for a certain query. > >In the QueryIndex interface we have the plain old getCost method which >selects the index returning the lower cost for the given query but, >recently, also an AdvancedQueryIndex interface has been introduced which, >if I understood things correctly, uses the IndexPlan(s) returned by each >query index for the given query to select which one has to be used. >So I would like to discuss if it's possible to clean up things a bit in >order to have a unified query selection mechanism. > >At the moment, in my opinion, one problem with the getCost() method is >that >it inherently merges the following topics: >- index capability to handle a certain query (can the QueryIndex handle >that query?) >- index efficiency in handling a certain query (how fast will the >QueryIndex will be in handling that query?) > >Also the efficiency is not evaluated on a "cost model", each QueryIndex >implementation can return an arbitrary different number; on one hand this >is ok as it allows to take very index specific constraint into account: on >the other hand if one has to write a new QueryIndex implementation he/she >will have to look into each other query index implementation to understand >(and design) if / when its index is picked up; and even with already >existing indexes it's not easy to say upfront which one will be selected >(e.g. for debugging purposes). > >With the AdvancedQueryIndex, if I understood it correctly (I just had a >look at it on Friday), a QueryIndex is selected upon its IndexPlan, which >is supposed to address better both the cost (as it explicitly exposes the >cost per execution, cost per entry and estimated entry count metrics) and >the query index capability to handle a certain query (e.g. this is used >for >ordered property index). >However, at the moment, only the OrderedPropertyIndex is using it so I >think it'd be good to decide if we want to go further with the >AdvancedQueryIndex also for the other QueryIndex implementations (and get >rid for example of the FullTextQueryIndex interface as it seems useless to >me) or not. > >One final question on query index selection, should we always select the >fastest index ? >Especially for full text ones this should be in some way configurable. > >What do others think? >Regards, >Tommaso > >p.s.: >As discussed also offline last week with some other folks maybe one >further >metric to be taken into consideration for the index selection is if the >index is synchronous or not
