[
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326012#comment-14326012
]
Benedict commented on CASSANDRA-4914:
-------------------------------------
I'm with Cristian here, as I suggested at the NGCC last year. If we want
efficient aggregations, they should absolutely be performed at the replicas. I
realise we're not aiming for that first time around, but IMO it should be the
long term goal. Shipping all of your data over the wire is a pretty significant
cost and bottleneck, making the current implementation more of a convenience
than an analytic tool.
It's possible to perform conflict resolution a few ways. Probably the best is
to first let the user specify if they care (CL=ONE is not exactly an uncommon
usecase, last I heard we reckon 30% of deployments use this. esp. for analytics
queries slight staleness may not be important), and if they do perform a
repair-aware read from each neighbour to ensure the replica is up-to-date. Or
calculate the result optimistically, along with a checksum and perform the
repair if either don't match. Or select your strategy based on if the data has
been updated recently (say, last few minutes), and if it has be pessimistic,
and otherwise be optimistic. This is largely what [~tjake]'s Repair Aware
Consistency Levels (CASSANDRA-7168) is about.
Generally, analytics queries are intended to be run over large, _majority_
static datasets, so the computation should be optimised for this IMO. There is
of course the complication of supporting deterministic aggregations over
multiple partitions, which would probably have to fallback to coordinator level
aggregation for operations that cannot be trivially composed exactly (e.g.
median), but most aggregations can be composed from partial computations
trivially.
The provision of a sampled approach seems like another excellent idea to me,
but an orthogonal one. The calculation should probably still be offloaded to
each node, then combined probabilistically. This would also support efficient
multi-partition queries for all aggregations.
I'm not saying any of these are trivial undertakings, but they should be what
we're aiming for AFAICT.
> Aggregation functions in CQL
> ----------------------------
>
> Key: CASSANDRA-4914
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Vijay
> Assignee: Benjamin Lerer
> Labels: cql, docs
> Fix For: 3.0
>
> Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt,
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;
>
> empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
> 130 | 3 | joe | doe | 10.1
> 130 | 2 | joe | doe | 100
> 130 | 1 | joe | doe | 1e+03
>
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);
>
> sum(salary) | empid
> -------------+--------
> 1110.1 | 130
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)