[
https://issues.apache.org/jira/browse/CASSANDRA-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486692#comment-14486692
]
Jonathan Shook commented on CASSANDRA-8826:
-------------------------------------------
Consider that many systems are implementing aggregate processing at the client
node. A more optimal system would allow those aggregates to be processed close
to storage rather than bulk shipping operands across the wire to the client
before any computation can even be started. Even using the coordinate for this
is relatively wasteful. After considering multiple options for how to handle
aggregates in a Cassandra-idiomatic way, I arrived at pretty much the same
place as [~benedict]. The point is not to try to emulate other systems, but to
highly optimize a very common and traffic-sensitive usage pattern.
The partial data scenarios (CL>1) are interesting, but you can easily describe
what a reasonable behavior would be if data were missing from a replica. In the
most basic case, you simply reflect the standard CL interpretation that "the
results from these nodes is not consistent at CL=Q". While this is not helpful
to clients as such, it is a consistent interpretation of the semantics. The
same types of things you might do as a user to deal with it do not change. If
the data of interest is consistent, then aggregations of that data will be
consistent, and vice-versa.
That almost certainly invites more questions about the likely scenario of
partial data for near-time reads at CL>1. That, to me, is the most interesting
and challenging part of this idea. If you simply active read repair logic as an
intermediate step, you still maintain the same CL semantics that users would
expect.
Am I missing something that makes this more complicated than I am thinking? My
impression is that the concern for complexity is more fairly placed on the more
advanced things that you might build on top of distributed single partition
aggregates, not the basic idea of it.
> Distributed aggregates
> ----------------------
>
> Key: CASSANDRA-8826
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8826
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Robert Stupp
> Priority: Minor
>
> Aggregations have been implemented in CASSANDRA-4914.
> All calculation is performed on the coordinator. This means, that all data is
> pulled by the coordinator and processed there.
> This ticket's about to distribute aggregates to make them more efficient.
> Currently some related tickets (esp. CASSANDRA-8099) are currently in
> progress - we should wait for them to land before talking about
> implementation.
> Another playgrounds (not covered by this ticket), that might be related is
> about _distributed filtering_.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)