[
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149000#comment-17149000
]
Caleb Rackliffe commented on CASSANDRA-15907:
---------------------------------------------
At this point, we're sitting on what appears to be 4 distinct approaches to
addressing the problems in the current implementation. Before trying to
contrast them all, I want to think through the kinds of usage we expect and the
consequences of that. Future indexing implementations aside, neither filtering
queries nor secondary index queries are currently meant to be used at scale
(especially at CL > ONE/LOCAL_ONE) without partition restrictions. Optimizing
for that case seems reasonable. The other big axis is how common out of sync
replicas actually are, and how responsive we have to be from "rare" to "entire
replica datasets are out of sync". What's currently in trunk does just fine if
there is very little out-of-sync data, especially in the common case that we're
limited to a partition. (i.e. The actual number of protection queries is very
low, because we group by partition.) Its weakness is the edge case.
bq. Issue blocking RFP read immediately at {{MergeListener#onMergedRows}} when
detecting potential outdated rows
This single-pass solution would excel in situations where there are very few
silent replicas and put very little stress on the heap, given it could simply
forgo caching merged rows that don't satisfy the query filter. It also appears
to be a fairly simple change to the existing logic. The downside of this
approach is that it would start to issue a pretty high volume of individual row
protection queries as it came across more silent replicas, without even the
benefit of mitigating partition grouping. It wouldn't require any new
guardrails around memory usage, and the worst that could happen is a query
timeout.
bq. We could try to not cache all the results but advance in blocks of a
certain fixed number of cached results, so we limit the number of cached
results while we can still group keys to do less queries. That is, we could
have that pessimistic SRP read prefetching and caching N rows completed with
extra queries to the silent replicas, plugged to another group of
unmerged-merged counters to prefetch more results if (probably) needed
This seems to retain all the nice characteristics of the current trunk
implementation (most importantly partition grouping for RFP queries), with the
added benefit that it should only use heap proportional to the actual user
limit (although not precisely, given the different between the batch size and
the limit). It wouldn't really require any new guardrails around memory usage,
given the tighter coupling to the limit or page size, and the worse case is
also a timeout. The stumbling block feels like complexity, but that might just
be my lack of creativity. [~adelapena] Wouldn't we have to avoid SRP in the
first phase of the query to limit the size of the result cache during batches?
I've been trying to figure out a way to merge these two ideas, i.e. to batch
partition/completion reads in the RFP {{MergeListener}}. Combined w/ filtering,
also in the {{MergeListener}}, we could discard (i.e. avoid caching) the rows
that don't pass the filter. The problem is that the return value of
{{onMergedRows()}} is what presently informs SRP/controls the counter.
> Operational Improvements & Hardening for Replica Filtering Protection
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-15907
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
> Project: Cassandra
> Issue Type: Improvement
> Components: Consistency/Coordination, Feature/2i Index
> Reporter: Caleb Rackliffe
> Assignee: Caleb Rackliffe
> Priority: Normal
> Labels: 2i, memory
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a
> few things we should follow up on, however, to make life a bit easier for
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be
> able to use simple arrays instead of lists for {{rowsToFetch}} and
> {{originalPartitions}}. Alternatively (or also), we may be able to null out
> references in these two collections more aggressively. (ex. Using
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}},
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather
> rather than serially querying every row that needs to be completed. This
> isn't a clear win perhaps, given it targets the latency of single queries and
> adds some complexity. (Certainly a decent candidate to kick even out of this
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side
> filtering protection" (which makes it seem like the coordinator doesn't
> filter) rather than "replica filtering protection" (which sounds more like
> what we actually do, which is protect ourselves against incorrect replica
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory
> usage of {{ReplicaFilteringProtection}} which caches row responses from the
> first round of requests. (Remember, these are later used to merged with the
> second round of results to complete the data for filtering.) Operators will
> likely need a way to protect themselves, i.e. simply fail queries if they hit
> a particular threshold rather than GC nodes into oblivion. (Having control
> over limits and page sizes doesn't quite get us there, because stale results
> _expand_ the number of incomplete results we must cache.) The fun question is
> how we do this, with the primary axes being scope (per-query, global, etc.)
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.).
> My starting disposition on the right trade-off between
> performance/complexity and accuracy is having something along the lines of
> cached rows per query. Prior art suggests this probably makes sense alongside
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]