[
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161154#comment-17161154
]
Andres de la Peña commented on CASSANDRA-15907:
-----------------------------------------------
[~maedhroz] Having a metric for the cache size is really nice. However, I think
it would be more useful to track the max per-query cache size, rather than the
per-partition size. This is because a single advance in the first phase merge
iterator can still insert an indefinite amount of partitions in the cache. For
example, let's consider this scenario when all the rows from an outdated
replica are superseded by the updated replica:
{code:java}
self._prepare_cluster(
create_table="CREATE TABLE t (k int, c int, v text, PRIMARY KEY(k, c))",
create_index="CREATE INDEX ON t(v)",
both_nodes=["INSERT INTO t (k, c, v) VALUES (0, 0, 'old')",
"INSERT INTO t (k, c, v) VALUES (0, 1, 'old')",
"INSERT INTO t (k, c, v) VALUES (1, 0, 'old')",
"INSERT INTO t (k, c, v) VALUES (1, 1, 'old')",
"INSERT INTO t (k, c, v) VALUES (2, 0, 'old')",
"INSERT INTO t (k, c, v) VALUES (2, 1, 'old')",
"INSERT INTO t (k, c, v) VALUES (3, 0, 'old')",
"INSERT INTO t (k, c, v) VALUES (3, 1, 'old')"],
only_node1=["DELETE FROM t WHERE k = 0",
"DELETE FROM t WHERE k = 1",
"DELETE FROM t WHERE k = 2",
"DELETE FROM t WHERE k = 3"])
self._assert_none("SELECT c FROM t WHERE v = 'old' LIMIT 1")
{code}
In this test all the 4 partitions are cached with a single advance of the
merged iterator, so the cache contains 16 rows at its maximum. However, the
metric records 8 times a per-partition cache size of 2. I think it would be
more useful either having a single metric with that max cache size of 12, or
two records (one per replica) with a value of 6. This would make it easier to
detect cases as the exposed in the example.
Also, being a per-query metric, it would be easier to advice operators to start
worrying about the consistency among replicas when the RFP metric starts to get
higher than the fetch size, independently of whether the queries are
single-partition or not.
As for tracking the cache size per replica o per query, I think it would be
nice if we used the same criteria that is used in the guardrail, so they
measure the same thing. That would mean either tracking the metric per query
and leaving the guardrail as it is, or changing the guardrail to be per-replica
instead of per-query.
WDYT?
> Operational Improvements & Hardening for Replica Filtering Protection
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-15907
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
> Project: Cassandra
> Issue Type: Improvement
> Components: Consistency/Coordination, Feature/2i Index
> Reporter: Caleb Rackliffe
> Assignee: Caleb Rackliffe
> Priority: Normal
> Labels: 2i, memory
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Time Spent: 4h 20m
> Remaining Estimate: 0h
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a
> few things we should follow up on, however, to make life a bit easier for
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be
> able to use simple arrays instead of lists for {{rowsToFetch}} and
> {{originalPartitions}}. Alternatively (or also), we may be able to null out
> references in these two collections more aggressively. (ex. Using
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}},
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather
> rather than serially querying every row that needs to be completed. This
> isn't a clear win perhaps, given it targets the latency of single queries and
> adds some complexity. (Certainly a decent candidate to kick even out of this
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side
> filtering protection" (which makes it seem like the coordinator doesn't
> filter) rather than "replica filtering protection" (which sounds more like
> what we actually do, which is protect ourselves against incorrect replica
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory
> usage of {{ReplicaFilteringProtection}} which caches row responses from the
> first round of requests. (Remember, these are later used to merged with the
> second round of results to complete the data for filtering.) Operators will
> likely need a way to protect themselves, i.e. simply fail queries if they hit
> a particular threshold rather than GC nodes into oblivion. (Having control
> over limits and page sizes doesn't quite get us there, because stale results
> _expand_ the number of incomplete results we must cache.) The fun question is
> how we do this, with the primary axes being scope (per-query, global, etc.)
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.).
> My starting disposition on the right trade-off between
> performance/complexity and accuracy is having something along the lines of
> cached rows per query. Prior art suggests this probably makes sense alongside
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]