[ https://issues.apache.org/jira/browse/CASSANDRA-6903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
C. Scott Andreas updated CASSANDRA-6903: ---------------------------------------- Component/s: Core > Allow a token scan to filter on partition key columns > ----------------------------------------------------- > > Key: CASSANDRA-6903 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6903 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Andy Neilson > Priority: Major > > When extracting data for analysis (e.g., in Hadoop) using a token scan, > allowing filtering on column that is part of the partition key allow for more > efficient processing. For example, assume that we have the following schema > (from the example defined > [here|http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/]): > {noformat} > CREATE TABLE temperature_by_day ( > weatherstation_id text, > date text, > event_time timestamp, > temperature text, > PRIMARY KEY ((weatherstation_id,date),event_time) > ); > {noformat} > Assume that I am primarily interested in doing analysis of more recent data, > so I can use a SELECT like the following to extract the data I am interested > in: > {noformat} > SELECT * > FROM temperature_by_day > WHERE token(weatherstation_id,date) > ? AND token(weatherstation_id,date) <= ? > AND event_time >= ? > LIMIT 5000 > ALLOW FILTERING; > {noformat} > The filtering is potentially expensive since it touches a lot of columns. > Since the {{date}} column that is used to fragment wide rows is related to > the {{event_time}}, I could apply a (redundant) filter to {{date}}, as in: > {noformat} > SELECT * > FROM temperature_by_day > WHERE token(weatherstation_id,date) > ? AND token(weatherstation_id,date) <= ? > AND event_time >= ? > AND date >= ? > LIMIT 5000 > ALLOW FILTERING; > {noformat} > ...but currently I can't add the filter on the {{date}} column because it is > part of the partition key. However, because this query is doing a token scan, > there really is no problem in filtering on the partition key. The predicate > on {{date}} can be evaluated directly on the row index without looking at the > values in columns at all. The effect is to efficiently filter out a large > swath of rows, and not forcing the scan to filter on rows that couldn't > possibly contain those dates. > There are probably lots of ways to optimize predicates on partition key > columns. For example, if the {{date}} column was made the first column in the > partition key, evaluating a range could be done without scanning the entire > row index. > In this case, if we have a year of data, but are only interested in > extracting the last day, so the overhead of filtering is reduced by a factor > of 365. > What I am looking for is: > * If the SELECT is a token scan, allow filtering on partition key columns. > * Predicates on partition key columns are evaluated on for the row as a > whole, before filtering on clustering columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org