C. Scott Andreas updated CASSANDRA-6903:
    Component/s: Core

> Allow a token scan to filter on partition key columns
> -----------------------------------------------------
>                 Key: CASSANDRA-6903
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6903
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Andy Neilson
>            Priority: Major
> When extracting data for analysis (e.g., in Hadoop) using a token scan, 
> allowing filtering on column that is part of the partition key allow for more 
> efficient processing. For example, assume that we have the following schema 
> (from the example defined 
> [here|http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/]):
> {noformat}
> CREATE TABLE temperature_by_day (
>    weatherstation_id text,
>    date text,
>    event_time timestamp,
>    temperature text,
>    PRIMARY KEY ((weatherstation_id,date),event_time)
> );
> {noformat}
> Assume that I am primarily interested in doing analysis of more recent data, 
> so I can use a SELECT like the following to extract the data I am interested 
> in:
> {noformat}
> FROM temperature_by_day
> WHERE token(weatherstation_id,date) > ? AND token(weatherstation_id,date) <= ?
> AND event_time >= ?
> LIMIT 5000
> {noformat}
> The filtering is potentially expensive since it touches a lot of columns. 
> Since the {{date}} column that is used to fragment wide rows is related to 
> the {{event_time}}, I could apply a (redundant) filter to {{date}}, as in:
> {noformat}
> FROM temperature_by_day
> WHERE token(weatherstation_id,date) > ? AND token(weatherstation_id,date) <= ?
> AND event_time >= ?
> AND date >= ?
> LIMIT 5000
> {noformat}
> ...but currently I can't add the filter on the {{date}} column because it is 
> part of the partition key. However, because this query is doing a token scan, 
> there really is no problem in filtering on the partition key. The predicate 
> on {{date}} can be evaluated directly on the row index without looking at the 
> values in columns at all. The effect is to efficiently filter out a large 
> swath of rows, and not forcing the scan to filter on rows that couldn't 
> possibly contain those dates.
> There are probably lots of ways to optimize predicates on partition key 
> columns. For example, if the {{date}} column was made the first column in the 
> partition key, evaluating a range could be done without scanning the entire 
> row index.
> In this case, if we have a year of data, but are only interested in 
> extracting the last day, so the overhead of filtering is reduced by a factor 
> of 365. 
> What I am looking for is:
> * If the SELECT is a token scan, allow filtering on partition key columns.
> * Predicates on partition key columns are evaluated on for the row as a 
> whole, before filtering on clustering columns.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to