[ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149335#comment-17149335
 ] 

Andres de la Peña commented on CASSANDRA-15907:
-----------------------------------------------

{quote}neither filtering queries nor secondary index queries are currently 
meant to be used at scale (especially at CL > ONE/LOCAL_ONE) without partition 
restrictions. Optimizing for that case seems reasonable. The other big axis is 
how common out of sync replicas actually are, and how responsive we have to be 
from "rare" to "entire replica datasets are out of sync". What's currently in 
trunk does just fine if there is very little out-of-sync data, especially in 
the common case that we're limited to a partition. (i.e. The actual number of 
protection queries is very low, because we group by partition.) Its weakness is 
the edge case.
{quote}
Agree, the current approach is probably adequate for most of the cases 
described in 8272/8273, which are already kind of edge cases. The cases with 
lots of out-of-sync data should be even more uncommon, and probably would 
happen when something is already going badly in the cluster. So perhaps the 
guardrail guarding the number of cached keys is enough to give us some 
protection in such edge cases, and we can move to a more sophisticated later, 
if those cases prove to not be so uncommon, or as part of a second round of 
improvements. The downside of the guardrail is that if we later decide to move 
to a different SRP approach the guardrail config property will become 
deprecated, possibly after a very short life.

{quote}
The stumbling block feels like complexity, but that might just be my lack of 
creativity. [~adelapena] Wouldn't we have to avoid SRP in the first phase of 
the query to limit the size of the result cache during batches?
{quote}
Don't think that I have a clear idea about how to exactly do this, it's just a 
rough idea and I could be missing something. The idea is using a kind of 
internal pagination in the first phase, that would still need SRP for the same 
reason that it needs it with the current approach. Indeed it would be a more 
complex approach but, if it's not a delusion and it works it would indeed give 
us the best of both words: limited memory usage and query grouping. But, as I 
said, the cases creating excessive memory pressure might be so rare that we 
might be well with the just-in-case guardrail.

{quote}
Combined w/ filtering, also in the {{MergeListener}}, we could discard (i.e. 
avoid caching) the rows that don't pass the filter. The problem is that the 
return value of {{onMergedRows()}} is what presently informs SRP/controls the 
counter.
{quote}
I think all the rows collected during the current first phase pass the filter. 
Regarding fetching the entire partition in the merge listener, with a 
per-partition single phase approach, I think that sounds like the last 
implementation attempt we did before moving to the current two phase approach 
because of how it messed with SRP, although I might be missing something 
different here. 




> Operational Improvements & Hardening for Replica Filtering Protection
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-15907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination, Feature/2i Index
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>              Labels: 2i, memory
>             Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any 
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to