[
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324408#comment-14324408
]
Tyler Hobbs commented on CASSANDRA-4914:
----------------------------------------
bq. I don't know the internals but it should be doable to push the aggregation
function to the partitions without requiring the data interface to understand
CQL.
The problem with pushing aggregate calculation down to the replicas is that
there's no conflict resolution. So the aggregation can be computed over stale
or deleted data. That may be acceptable if you're reading at consistency level
ONE, but then we're dealing with a limited, special case.
bq. Note that all agg functions are eminently parallelizible
I don't believe this is true. Off the top of my head, computing the median of
a dataset is not really parallelizable (without some sort of internode
communication).
bq. dealing with consistency is tricky but then Cassandra is by design
eventually consistent so why not have eventually consistent aggregations. Just
pick a partition and aggregate on that. With large datasets an average
differing at the sixth decimal won't really matter.
That may be acceptable for aggregates like average, but other aggregates may
require precision.
With all of that said, I wouldn't necessarily be opposed to supporting
selecting a sampling of data from a table (and allowing an aggregate to be run
over that), but I suggest opening a new ticket for that discussion.
> Aggregation functions in CQL
> ----------------------------
>
> Key: CASSANDRA-4914
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Vijay
> Assignee: Benjamin Lerer
> Labels: cql, docs
> Fix For: 3.0
>
> Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt,
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;
>
> empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
> 130 | 3 | joe | doe | 10.1
> 130 | 2 | joe | doe | 100
> 130 | 1 | joe | doe | 1e+03
>
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);
>
> sum(salary) | empid
> -------------+--------
> 1110.1 | 130
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)