[jira] [Created] (CASSANDRA-6903) Allow a token scan to filter on partition key columns

Andy Neilson (JIRA) Fri, 21 Mar 2014 09:16:04 -0700

Andy Neilson created CASSANDRA-6903:
---------------------------------------


             Summary: Allow a token scan to filter on partition key columns
                 Key: CASSANDRA-6903
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6903
             Project: Cassandra
          Issue Type: Improvement
          Components: Core, Hadoop
            Reporter: Andy Neilson


When extracting data for analysis (e.g., in Hadoop) using a token scan, 
allowing filtering on column that is part of the partition key allow for more 
efficient processing. For example, assume that we have the following schema 
(from the example defined 
[here|http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/]):

{noformat}
CREATE TABLE temperature_by_day (
   weatherstation_id text,
   date text,
   event_time timestamp,
   temperature text,
   PRIMARY KEY ((weatherstation_id,date),event_time)
);
{noformat}

Assume that I am primarily interested in doing analysis of more recent data, so 
I can use a SELECT like the following to extract the data I am interested in:

{noformat}
SELECT *
FROM temperature_by_day
WHERE token(weatherstation_id,date) > ? AND token(weatherstation_id,date) <= ?
AND event_time >= ?
LIMIT 5000
ALLOW FILTERING;
{noformat}

The filtering is potentially expensive since it touches a lot of columns. Since 
the {{date}} column that is used to fragment wide rows is related to the 
{{event_time}}, I could apply a (redundant) filter to {{date}}, as in:

{noformat}
SELECT *
FROM temperature_by_day
WHERE token(weatherstation_id,date) > ? AND token(weatherstation_id,date) <= ?
AND event_time >= ?
AND date >= ?
LIMIT 5000
ALLOW FILTERING;
{noformat}

...but currently I can't add the filter on the {{date}} column because it is 
part of the partition key. However, because this query is doing a token scan, 
there really is no problem in filtering on the partition key. The predicate on 
{{date}} can be evaluated directly on the row index without looking at the 
values in columns at all. The effect is to efficiently filter out a large swath 
of rows, and not forcing the scan to filter on rows that couldn't possibly 
contain those dates.

There are probably lots of ways to optimize predicates on partition key 
columns. For example, if the {{date}} column was made the first column in the 
partition key, evaluating a range could be done without scanning the entire row 
index.

In this case, if we have a year of data, but are only interested in extracting 
the last day, so the overhead of filtering is reduced by a factor of 365. 

What I am looking for is:

* If the SELECT is a token scan, allow filtering on partition key columns.
* Predicates on partition key columns are evaluated on for the row as a whole, 
before filtering on clustering columns.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (CASSANDRA-6903) Allow a token scan to filter on partition key columns

Reply via email to