[
https://issues.apache.org/jira/browse/KUDU-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824692#comment-15824692
]
Todd Lipcon commented on KUDU-1291:
-----------------------------------
We don't currently have any cardinality estimate for columns, but it's a useful
thing to be able to collect as input into optimizations like this.
A couple options come to mind:
- if the column is already dictionary-coded, then within the context of a
single DiskRowSet, this optimization/transformation is pretty easy to
accomplish. We already have the unique list of values in the dictionary block.
The downside is that we only support dictionary encoding for string columns.
- we could do something to compute approximate cardinalities for columns in a
general case (eg theta sketches) which also has some applications for query
planners, etc. But, that only gives us the cardinalities and not the distinct
list of values. That isn't necessarily problematic -- maybe we don't need to
treat this as an "up-front" optimization but rather could do it more
on-the-fly? We'd need to discuss the design of this a bit, and maybe do some
prototypes, to know the best way to go about it.
One specific case where this is particularly easy is when the leading column(s)
have cardinality = 1 within a particular rowset. This is easy to identify since
we already have the min/max row key of a rowset. In that particular case, we
could trivially add the 'leading_column = <...>" predicate and not have to
worry about the disjunctions or new "skip" code paths. This is probably
effective for some of the cases mentioned such as leading year/month in a large
dataset.
> Efficiently support predicates on non-prefix key components
> -----------------------------------------------------------
>
> Key: KUDU-1291
> URL: https://issues.apache.org/jira/browse/KUDU-1291
> Project: Kudu
> Issue Type: Sub-task
> Components: perf, tablet
> Reporter: Todd Lipcon
>
> In a lot of workloads, users have a compound primary key where the first
> component (or few components) is low cardinality. For example, a time series
> workload may have (year, month, day, entity_id, timestamp) as a primary key.
> A metrics or log storage workload might have (hostname, timestamp).
> It's common to want to do cross-user or cross-date analytics like 'WHERE
> timestamp BETWEEN <a> and <b>' without specifying any predicate for the first
> column(s) of the PK. Currently, we do not execute this efficiently, but
> rather scan the whole table evaluating the predicate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)