[ 
https://issues.apache.org/jira/browse/DRILL-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998093#comment-14998093
 ] 

James Taylor commented on DRILL-3929:
-------------------------------------

[~amansinha100] - I like the idea of a new type of index that is maintained 
asynchronously - we've had users ask for this in Phoenix too. The tricky part 
is knowing when it's ok to use (since you don't know how up-to-date it is), but 
I'm sure for some use cases it's good enough given some tolerable lag time.

As Jacques mentioned, we're doing some integration work so that Phoenix can 
plug into Drill. Would be great if we could work together on secondary index 
support. Materialized views are a good fit we think. They'll handle our 
functional indexes as well which can get tricky. If Elasticsearch could be 
packaged as a Calcite adapter, perhaps you could model this the same way?

As far as Phoenix indexes, there are two main types:
- Global indexes which have a separate table orthogonal to the data table.
- Local indexes which colocates the index data with the table data on the same 
region server.

Both can have covered columns. The cost depends on whether all referenced 
columns are contained by the index. If they are, then it's always cheaper to 
use the index. Otherwise, the cost for joining back depends on the kind of 
index: for local indexes it's relatively cheap because we know the lookup of 
the data row is a local get. For global indexes, it's more expensive but we 
have the PK columns in the index row so we can leverage our skip scan to 
improve the lookup. We haven't got to the point of keeping histograms to know 
selectivity based on additional filtering, but we do know how much data will be 
scanned. We're looking toward Calcite to have a good way to plug in more 
advanced statistics to improve our costing.

One tricky aspect is dealing with secondary index maintenance, for example, 
dealing with out-of-order updates. It makes it tricky to know what the old 
index value is so that you can remove it before adding the new index value. 
There's another wrinkle with transactions and read-your-own-write semantics, as 
multiple in-flight transactions may each have different values for the same 
data row. Depending on which transaction commits first, the others will need to 
get rolled back.

I'm happy to setup a hangout to discuss further. One meta point too: it'd be 
great if we could do a feasibility study of having Phoenix support in the MapR 
distro. The Phoenix coprocessors do not step deeply into HBase internals 
(ignoring local indexing which is being reworked not to), plus the HBase team 
has put slimmed down interfaces in the coprocessor APIs. I'd be happy to guide 
someone along on that.


> Support the ability to query database tables using external indices           
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-3929
>                 URL: https://issues.apache.org/jira/browse/DRILL-3929
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Execution - Relational Operators, Query Planning & 
> Optimization
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>
> This is a placeholder for adding support in Drill to query database tables 
> using external indices.  I will add more details about the use case and a 
> preliminary design proposal.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to