[
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299993#comment-15299993
]
Benjamin Lerer commented on CASSANDRA-11871:
--------------------------------------------
{quote}
{code}
SELECT floor(time, 3h), count() FROM foo GROUP BY floor(time, 3h);
{code}
When is my "starting point" then?
{quote}
In this case, I would take as starting point some Jan 1 at 0:00.
{quote}
In fact, even if have a {{WHERE}}, I could have (assuming CASSANDRA-11873):
{code}
SELECT floor(time, 3h), count() FROM foo WHERE time > now() - 10h GROUP BY
floor(time, 3h);
{code}
in which case I probably don't want my intervals to be based on the value of
{{now()}}.
{quote}
In this case, my starting point would be {{now() - 10h}}.
My idea was to determine automatically the starting point. I think it should be
possible. Of course it will require some extra logic in {{SelectStatement}} to
handle it.
The main advantage is that it will simplify the query. Instead of being:
{code}
SELECT floor(time, 3h, now() - 10h), count() FROM foo WHERE time > now() - 10h
GROUP BY floor(time, 3h, now() - 10h);
{code}
the query will be:
{code}
SELECT floor(time, 3h), count() FROM foo WHERE time > now() - 10h GROUP BY
floor(time, 3h);
{code}
Otherwise, an important restriction to not forget is that we only allow
functions on the clustering columns.
> Allow to aggregate by time intervals
> ------------------------------------
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
> Issue Type: Improvement
> Components: CQL
> Reporter: Benjamin Lerer
> Assignee: Benjamin Lerer
> Fix For: 3.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
> symbol text,
> date date,
> time time,
> priceMantissa int,
> priceExponent tinyint,
> volume int,
> PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same
> order. As we can have to process a large amount of data, we want to try to
> limit ourself to the cases where we can build the groups on the flight (which
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL'
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL'
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time),
> second(time)}}).
> In my opinion a function like {{floor(<columnName>, <time range>)}} will be
> much better as it does not allow for this type of mistakes and is much more
> flexible (you can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date =
> '2016-01-11' GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the
> starting point. For a query like: {{SELECT floor(time, m), count() FROM
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00'
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)