[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

Anton Slutsky (JIRA) Mon, 16 Feb 2015 22:33:35 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323716#comment-14323716
 ]


Anton Slutsky commented on CASSANDRA-4914:
------------------------------------------

I think, it may not be all that complicated, at least in some cases.  If we 
consider the avg function for example, any record in the resultset of interest 
has a non-zero probability of being exactly the average value, kind of by 
definition :-), and nothing prevents us from grabbing the very first record and 
looking at it from that point of view.  The key here is, of course, to figure 
out what that non-zero probability is, but that can also be approximated with 
some accuracy by sampling a little bit beyond the first record.  If we are 
smart about how we sample and if we have an idea as to how big the actual 
resultset is, reasonably close approximation of the average value can be 
achieved and the probability of it being the true average can be computed with 
common techniques. Along the same lines, "sum" can be thought of as an integral 
over the shape approximated by the avg, which can also be approximated with 
some probability of being correct.

Of course, there are many problems with the above from the statistical point of 
view.  For one, resultsets are often ordered in some way, so sampling cannot be 
assumed to be random, which is not good.  

Anyway, I dont know if this is the right use case, but I really need aggregate 
functions for what I'm trying to do and right now I have to fire up a hadoop 
cluster to get simple aggregates computed, which is a major pain and takes 
forever.

I'll give it a shot in my own code and see if I can come up with a reasonable 
approach.  Perhaps others will see this discussion and suggest some ideas.

Thanks,
Anton

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column 
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                  
>                   
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                      
>               
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

Reply via email to