[ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157366#comment-17157366
 ] 

Andres de la Peña commented on CASSANDRA-15907:
-----------------------------------------------

I have left 
[here|https://github.com/adelapena/cassandra/commit/afb3aafbca48d9ec7b5810d8dee2b1b6d6e4dc50]
 some changes to {{ReplicaFilteringProtection}} to reduce the number of cached 
rows in some cases.

The caching happens when we fully consume pessimistically SRP-protected results 
of the first iteration. This is done to collect the primary keys of the silent 
replicas that have to be queried. The suggested change avoids that full 
consumption of the first iteration results. Instead, it only consumes a 
partition each time the second phase iterators (those returned by 
{{queryProtectedPartitions}}) don't have a cached partition to offer. At that 
moment the first phase iterator is advanced one partition, so it caches and 
fetches all the partitions between the last SRP-protected partition and the 
next one. Note that consuming one more merged partition might involve reading 
an undefined number of partitions from the replica responses if they invalidate 
each other, so it doesn't eliminate the problem of caching, it just alleviates 
it in some cases.

Caching is still done at the partition level, so it can ask the replicas with a 
single query per partition. Thus, this change would be beneficial mainly for 
multi-partition queries, for example the query in the following test will go 
down from 20 cached rows to just 4:
{code:python}
self._prepare_cluster(
    create_table="CREATE TABLE t (k int PRIMARY KEY, v text)",
    create_index="CREATE INDEX ON t(v)",
    both_nodes=["INSERT INTO t (k, v) VALUES (5, 'old')",
                "INSERT INTO t (k, v) VALUES (1, 'old')",
                "INSERT INTO t (k, v) VALUES (8, 'old')",
                "INSERT INTO t (k, v) VALUES (0, 'old')",
                "INSERT INTO t (k, v) VALUES (2, 'old')",
                "INSERT INTO t (k, v) VALUES (4, 'old')",
                "INSERT INTO t (k, v) VALUES (7, 'old')",
                "INSERT INTO t (k, v) VALUES (6, 'old')",
                "INSERT INTO t (k, v) VALUES (9, 'old')",
                "INSERT INTO t (k, v) VALUES (3, 'old')"],
    only_node1=["INSERT INTO t (k, v) VALUES (5, 'new')",
                "INSERT INTO t (k, v) VALUES (3, 'new')"])
self._assert_all("SELECT * FROM t WHERE v = 'old'", rows=[[1, 'old'], [8, 
'old'], [0, 'old'], [2, 'old'], [4, 'old'], [7, 'old'], [6, 'old'], [9, 'old']])
{code}
The number of both SRP and RFP requests remains the same in most cases, 
although in some particular cases the proposed approach can save us a few 
queries.

We could also do something similar at the row level, so instead of advancing 
partition per partition we would advance row per row until we have a 
significant amount of either cached data and/or primary keys to fetch. However, 
even in that case it would still be possible to design (unlikely) scenarios 
where to advance the pessimistic SRP iterator a single position we would need 
to read (and cache) the entire db. Thus, we would still need the guardrail, and 
I'm not sure advancing row per row worths the additional complexity, given that 
the problematic cases are supposed to be unlikely. Perhaps we could give it a 
go after 4.0, as an improvement.

I haven't updated the guardrail tests, they would need minor changes for the 
proposed approach because some queries cache less rows than before.

[~maedhroz] WDYT?


> Operational Improvements & Hardening for Replica Filtering Protection
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-15907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination, Feature/2i Index
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>              Labels: 2i, memory
>             Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any 
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to