[jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time intervals

Benjamin Lerer (JIRA) Wed, 25 May 2016 07:26:42 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300132#comment-15300132
 ]


Benjamin Lerer commented on CASSANDRA-11871:
--------------------------------------------

{quote}
I'm convinced that it's rarely actually what you want: my point with the 
{{now() - 10h}} example is that if I aggregate stuffs say every hours, in 99% 
of the case I want my intervals to be 10:00, 11:00, etc.. I don't want them to 
be 10:21, 11:21, ... or 10:59, 11:59, .... based on when my query happens to be 
executing.
{quote}
Sorry, I wrote it to quickly. If {{now() - 10h}} is {{10:21}} and the function 
is {{floor(time, 3h)}} my start time would have been {{10:00}}. If the function 
is {{floor(time, 5m)}} my start time would have been {{10:21}}. 
Nevertheless, I agree that it can produce some weird ranges.

{quote}
But anyway, whatever way we use to pick the starting point, I'm also slightly 
wary of the fact that simple arithmetic on dates doesn't always work too well, 
so I wonder if that's a problem here (concretely, if you want 3h intervals 
starting from a long time ago, the correct intervals for today should 
theoretically depend at least on the number of leap years in between).
{quote}
It is clear that we will have to be carefull.

> Allow to aggregate by time intervals
> ------------------------------------
>
>                 Key: CASSANDRA-11871
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP 
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the 
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
>     symbol text,
>     date date,
>     time time,
>     priceMantissa int,
>     priceExponent tinyint,
>     volume int,
>     PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same 
> order. As we can have to process a large amount of data, we want to try to 
> limit ourself to the cases where we can build the groups on the flight (which 
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we 
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the 
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make 
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
> second(time)}}).
> In my opinion a function like {{floor(<columnName>, <time range>)}} will be 
> much better as it does not allow for this type of mistakes and is much more 
> flexible (you can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = 
> '2016-01-11' GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the 
> starting point. For a query like:  {{SELECT floor(time, m), count() FROM 
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' 
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the 
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time intervals

Reply via email to