[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

Benedict (JIRA) Wed, 18 Feb 2015 07:31:33 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326012#comment-14326012
 ]


Benedict commented on CASSANDRA-4914:
-------------------------------------

I'm with Cristian here, as I suggested at the NGCC last year. If we want 
efficient aggregations, they should absolutely be performed at the replicas. I 
realise we're not aiming for that first time around, but IMO it should be the 
long term goal. Shipping all of your data over the wire is a pretty significant 
cost and bottleneck, making the current implementation more of a convenience 
than an analytic tool.

It's possible to perform conflict resolution a few ways. Probably the best is 
to first let the user specify if they care (CL=ONE is not exactly an uncommon 
usecase, last I heard we reckon 30% of deployments use this. esp. for analytics 
queries slight staleness may not be important), and if they do perform a 
repair-aware read from each neighbour to ensure the replica is up-to-date. Or 
calculate the result optimistically, along with a checksum and perform the 
repair if either don't match. Or select your strategy based on if the data has 
been updated recently (say, last few minutes), and if it has be pessimistic, 
and otherwise be optimistic. This is largely what [~tjake]'s Repair Aware 
Consistency Levels (CASSANDRA-7168) is about.

Generally, analytics queries are intended to be run over large, _majority_ 
static datasets, so the computation should be optimised for this IMO. There is 
of course the complication of supporting deterministic aggregations over 
multiple partitions, which would probably have to fallback to coordinator level 
aggregation for operations that cannot be trivially composed exactly (e.g. 
median), but most aggregations can be composed from partial computations 
trivially.

The provision of a sampled approach seems like another excellent idea to me, 
but an orthogonal one. The calculation should probably still be offloaded to 
each node, then combined probabilistically. This would also support efficient 
multi-partition queries for all aggregations.

I'm not saying any of these are trivial undertakings, but they should be what 
we're aiming for AFAICT.

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column 
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                  
>                   
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                      
>               
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

Reply via email to