[ 
https://issues.apache.org/jira/browse/KUDU-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824692#comment-15824692
 ] 

Todd Lipcon commented on KUDU-1291:
-----------------------------------

We don't currently have any cardinality estimate for columns, but it's a useful 
thing to be able to collect as input into optimizations like this.

A couple options come to mind:
- if the column is already dictionary-coded, then within the context of a 
single DiskRowSet, this optimization/transformation is pretty easy to 
accomplish. We already have the unique list of values in the dictionary block. 
The downside is that we only support dictionary encoding for string columns.
- we could do something to compute approximate cardinalities for columns in a 
general case (eg theta sketches) which also has some applications for query 
planners, etc. But, that only gives us the cardinalities and not the distinct 
list of values. That isn't necessarily problematic -- maybe we don't need to 
treat this as an "up-front" optimization but rather could do it more 
on-the-fly? We'd need to discuss the design of this a bit, and maybe do some 
prototypes, to know the best way to go about it.

One specific case where this is particularly easy is when the leading column(s) 
have cardinality = 1 within a particular rowset. This is easy to identify since 
we already have the min/max row key of a rowset. In that particular case, we 
could trivially add the 'leading_column = <...>" predicate and not have to 
worry about the disjunctions or new "skip" code paths. This is probably 
effective for some of the cases mentioned such as leading year/month in a large 
dataset.

> Efficiently support predicates on non-prefix key components
> -----------------------------------------------------------
>
>                 Key: KUDU-1291
>                 URL: https://issues.apache.org/jira/browse/KUDU-1291
>             Project: Kudu
>          Issue Type: Sub-task
>          Components: perf, tablet
>            Reporter: Todd Lipcon
>
> In a lot of workloads, users have a compound primary key where the first 
> component (or few components) is low cardinality. For example, a time series 
> workload may have (year, month, day, entity_id, timestamp) as a primary key. 
> A metrics or log storage workload might have (hostname, timestamp).
> It's common to want to do cross-user or cross-date analytics like 'WHERE 
> timestamp BETWEEN <a> and <b>' without specifying any predicate for the first 
> column(s) of the PK. Currently, we do not execute this efficiently, but 
> rather scan the whole table evaluating the predicate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to