[jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time intervals

Sylvain Lebresne (JIRA) Wed, 25 May 2016 01:57:59 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299700#comment-15299700
 ]


Sylvain Lebresne commented on CASSANDRA-11871:
----------------------------------------------

I would truly hate to limit this to a single time-related function, so I 
*really* think we should re-phrase this ticket as adding support for functions 
(as generally as possible, see below) in {{GROUP BY}}.

In practice, there is obviously limitations to what functions we can support in 
{{GROUP BY}}. Informally, we need the {{GROUP BY}} to "respect" the clustering 
order even when functions are applied. And I believe that for that, we can use 
the following sufficient and reasonably simple to check conditions:
# We only allow functions on a single column. That is, {{GROUP BY a, f(b)}} 
might be ok (if the other conditions hold), but {{GROUP BY f(a, b)}} won’t be.
# If you ignore the function calls, the {{GROUP BY}} should still be valid (in 
the sense of CASSANDRA-10707). That is, {{GROUP BY f(a), g(b)}} would only be 
valid if {{GROUP BY a, b}} is.
# We only allow functions in {{GROUP BY}} that are monotonic.

I'll note that those conditions are _sufficient_ (if I'm not missing something) 
but they may not be _necessary_ (in other words, there is cases that will be 
rejected by those conditions but coudl still "respect" the cluster order). This 
is ok though because those rules are general enough to be useful imo. The main 
point is that they are simple to check.

On the 3rd point though, validating that a function is monotic automatically is 
not at all easy and I'm not suggesting we do so. What I am suggesting is that 
we simply manually mark functions that are know to be monotonic (very 
concretely, we'll add a {{isMonotonic}} method to the {{Function}} class). And 
initially, we'll only mark "native" functions as such, UDF will also have this 
method return {{false}}, though we could in the future allow users to declare 
their UDF as monotonic if we so wish.

Of course, I fully agree that we also want to add a {{floor}}-like function as 
suggested above.

bq. An important aspect to keep in mind with a function like floor is the 
starting point.

I'm not exactly sure how that would work to be honest. First because, if I 
understand what you're saying, it means the result of the floor method depends 
on what is in the {{WHERE}} clause, and not only does that sound like a mess to 
implement, but how would that work if I do:
{noformat}
SELECT floor(time, 3h), count() FROM foo GROUP BY floor(time, 3h);
{noformat}
When is my "starting point" then? Unless you mean that the {{floor}} function 
has some per-partition state and the starting point is based on the first value 
it receives, but that's even worth because that would mean the returning 
intervals are not predictable.

In fact, even if have a {{WHERE}}, I could have (assuming CASSANDRA-11873):
{noformat}
SELECT floor(time, 3h), count() FROM foo WHERE time > now() - 10h GROUP BY 
floor(time, 3h);
{noformat}
in which case I probably don't want my intervals to be based on the value of 
{{now()}}.

Am I missing something obvious?

Anyway, I really think the {{floor}} function should be a normal function, that 
only depends on its arguments. I do like the flexibility of being able to 
define precise intervals, and I suppose we should provide the starting point as 
argument of the method. Meaning that general form of the function would be 
{{floor(value, range, starting_point)}} and would return the biggest time {{t}} 
such that {{t < value}} and such that {{t = starting_point + n * range}} for 
some integer {{n}}. Assuming we can provide that, we could just also have 
{{floor(value, range)}} as just a shortcut using a hard-coded 
{{starting_point}} (some Jan 1 at 0:00). But that said, I'm not sure such 
method can be computed efficiently, so not sure how viable that is.


> Allow to aggregate by time intervals
> ------------------------------------
>
>                 Key: CASSANDRA-11871
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP 
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the 
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
>     symbol text,
>     date date,
>     time time,
>     priceMantissa int,
>     priceExponent tinyint,
>     volume int,
>     PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same 
> order. As we can have to process a large amount of data, we want to try to 
> limit ourself to the cases where we can build the groups on the flight (which 
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we 
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the 
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make 
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
> second(time)}}).
> In my opinion a function like {{floor(<columnName>, <time range>)}} will be 
> much better as it does not allow for this type of mistakes and is much more 
> flexible (you can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = 
> '2016-01-11' GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the 
> starting point. For a query like:  {{SELECT floor(time, m), count() FROM 
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' 
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the 
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time intervals

Reply via email to