[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323669#comment-14323669
 ] 

Anton Slutsky commented on CASSANDRA-4914:
------------------------------------------

Hello all,

I noticed that some of the aggregate functions discussed on this thread made it 
into the trunk.  I'm a little concerned with the implementation.  It looks like 
aggregates, such as sum, avg, etc. are implemented in code by basically looping 
through the result set pages and computing the desired aggregates in code.  I'm 
worried that, since Cassandra is meant for large volumes of data, this is not 
at all a feasible implementation for real world cases.  I tried using avg on a 
more or less sizable dataset and observed two things -- first, my select 
statement would time out even with bumped up read timeout setting and second, 
CPU that's running the average computation is quite busy.

Obviously, there's only so much that can be done in terms of computing these 
aggregates without resorting to some sort of distributed computation framework, 
but I'd like to suggest a slightly different approach.  I wonder if we can just 
rethink how we think about aggregate functions in context of large data.  
Perhaps, what we could do is consider a probabilistic aggregates instead of raw 
computable ones?  That is, instead of striving to compute an aggregate on an 
entire resultset, maybe we can compute the aggregate with a stated probability 
of that aggregate being true.

For example:

select probabilistic_avg(my_col) from my_table;

would return something like a map:

{"avg":101.1, "prob":0.78}

where "avg" is our probabilistic avg and "prob" is the probability of it being 
what we say it is.

Of course, that wont be as good as the real thing, but it still has value in 
many cases, I think.  And it can be implemented in a scalable way with some 
scratch system tables.

I'm happy to give it a stab if this is of interest to anyone.

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column 
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                  
>                   
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                      
>               
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to