[
https://issues.apache.org/jira/browse/DRILL-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998093#comment-14998093
]
James Taylor commented on DRILL-3929:
-------------------------------------
[~amansinha100] - I like the idea of a new type of index that is maintained
asynchronously - we've had users ask for this in Phoenix too. The tricky part
is knowing when it's ok to use (since you don't know how up-to-date it is), but
I'm sure for some use cases it's good enough given some tolerable lag time.
As Jacques mentioned, we're doing some integration work so that Phoenix can
plug into Drill. Would be great if we could work together on secondary index
support. Materialized views are a good fit we think. They'll handle our
functional indexes as well which can get tricky. If Elasticsearch could be
packaged as a Calcite adapter, perhaps you could model this the same way?
As far as Phoenix indexes, there are two main types:
- Global indexes which have a separate table orthogonal to the data table.
- Local indexes which colocates the index data with the table data on the same
region server.
Both can have covered columns. The cost depends on whether all referenced
columns are contained by the index. If they are, then it's always cheaper to
use the index. Otherwise, the cost for joining back depends on the kind of
index: for local indexes it's relatively cheap because we know the lookup of
the data row is a local get. For global indexes, it's more expensive but we
have the PK columns in the index row so we can leverage our skip scan to
improve the lookup. We haven't got to the point of keeping histograms to know
selectivity based on additional filtering, but we do know how much data will be
scanned. We're looking toward Calcite to have a good way to plug in more
advanced statistics to improve our costing.
One tricky aspect is dealing with secondary index maintenance, for example,
dealing with out-of-order updates. It makes it tricky to know what the old
index value is so that you can remove it before adding the new index value.
There's another wrinkle with transactions and read-your-own-write semantics, as
multiple in-flight transactions may each have different values for the same
data row. Depending on which transaction commits first, the others will need to
get rolled back.
I'm happy to setup a hangout to discuss further. One meta point too: it'd be
great if we could do a feasibility study of having Phoenix support in the MapR
distro. The Phoenix coprocessors do not step deeply into HBase internals
(ignoring local indexing which is being reworked not to), plus the HBase team
has put slimmed down interfaces in the coprocessor APIs. I'd be happy to guide
someone along on that.
> Support the ability to query database tables using external indices
> ------------------------------------------------------------------------------
>
> Key: DRILL-3929
> URL: https://issues.apache.org/jira/browse/DRILL-3929
> Project: Apache Drill
> Issue Type: New Feature
> Components: Execution - Relational Operators, Query Planning &
> Optimization
> Reporter: Aman Sinha
> Assignee: Aman Sinha
>
> This is a placeholder for adding support in Drill to query database tables
> using external indices. I will add more details about the use case and a
> preliminary design proposal.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)