[
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323716#comment-14323716
]
Anton Slutsky commented on CASSANDRA-4914:
------------------------------------------
I think, it may not be all that complicated, at least in some cases. If we
consider the avg function for example, any record in the resultset of interest
has a non-zero probability of being exactly the average value, kind of by
definition :-), and nothing prevents us from grabbing the very first record and
looking at it from that point of view. The key here is, of course, to figure
out what that non-zero probability is, but that can also be approximated with
some accuracy by sampling a little bit beyond the first record. If we are
smart about how we sample and if we have an idea as to how big the actual
resultset is, reasonably close approximation of the average value can be
achieved and the probability of it being the true average can be computed with
common techniques. Along the same lines, "sum" can be thought of as an integral
over the shape approximated by the avg, which can also be approximated with
some probability of being correct.
Of course, there are many problems with the above from the statistical point of
view. For one, resultsets are often ordered in some way, so sampling cannot be
assumed to be random, which is not good.
Anyway, I dont know if this is the right use case, but I really need aggregate
functions for what I'm trying to do and right now I have to fire up a hadoop
cluster to get simple aggregates computed, which is a major pain and takes
forever.
I'll give it a shot in my own code and see if I can come up with a reasonable
approach. Perhaps others will see this discussion and suggest some ideas.
Thanks,
Anton
> Aggregation functions in CQL
> ----------------------------
>
> Key: CASSANDRA-4914
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Vijay
> Assignee: Benjamin Lerer
> Labels: cql, docs
> Fix For: 3.0
>
> Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt,
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;
>
> empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
> 130 | 3 | joe | doe | 10.1
> 130 | 2 | joe | doe | 100
> 130 | 1 | joe | doe | 1e+03
>
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);
>
> sum(salary) | empid
> -------------+--------
> 1110.1 | 130
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)