[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985099#comment-16985099
 ] 

Alex Petrov commented on CASSANDRA-15350:
-----------------------------------------

 [~yifanc] thank you for the patch. I agree with [~spod] notes on the naming. 
Maybe other potential name could be something like 
{{CASUnknownResultException}}?

Regarding whether you'd like to retry or not. Since contention itself is a 
result of concurrency, backing off and not retrying may actually a good idea. 
In case of uncertainty, I'd say you'd usually retry with another write rather 
than reading and then writing, at least that's how I would imagine application 
logic in such case. In case of a failed subsequent write, you'll also get a 
reason in form of a new value. Similar naming should apply to {{uncertainties}} 
in metrics, etc.

Comments: 
  * In {{ErrorMessage#decode}}, you're unconditionally changing the protocol 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR126].
 If I understand correctly, we can still have a combination of {{WriteTimeout}} 
and {{CAS}} in {{v4}}. So if we receive a message from {{v4}} on a {{v5}} node, 
we won't be able to interpret it correctly. You can try and write in-jvm test 
to test for that, or use in-jvm test facilities to try and encode message on 
one version of the node and decode it on the other. 
  * {{ErrorMessage#encode}} we seem to be aided by the fact that 
{{getBackwardsCompatibleException}} gives us the right exception. However, 
since these two parts of code are disjoint, and there's no guarantee future 
changes will be careful, I'd add an assertion that [this code 
path|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR242]
 should be possible only under {{v5}}.
  * Same logic as above applies for size calculation 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR307]
 and 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR325].
 
  * {{CAS_UNCERTAINTY}} clause is new in protocol {{v5}}. We should not attempt 
to interpret or write it for the older protocol versions. In fact, for older 
protocol version we should still encode it as write timeout to be fully 
compatible.
  * I'd argue that "uncertainties" should not be counted towards timeouts like 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-71f06c193f5b5e270cf8ac695164f43aR295].
 But someone with better operational insight might be a better person to answer 
this.
  * Unless you're submitting patches to 2.2, 3.0, and 3.11, let's roll back 
changes to {{IMessageFilters}}, since their API has to be binary compatible 
with older versions.
  * Should we add timeout tests for responses as well as requests in 
{{CasWriteTest}}?
  * Is it possible to try and simplify {{testCasWriteTimeoutDueToContention}}, 
can we achieve contention with {{N}} threads?
  * Similar comment for {{testCasWriteUncertainty_ProposalNotReplayed}}: both 
tests peer quite a lot into implementation internals. If we can't do this, I'd 
just leave the tests out from commit since it might be difficult to maintain 
them. I've also been able to catch a failure of 
{{testCasWriteUncertainty_ProposalNotReplayed}} after several thousand runs. 
Maybe we can use fuzzing to reproduce both scenarios?
  * Can we add a cross-version ser-de test, too?

Some nits: 
  * In {{ErrorMessage#decode}}, there are extra brackets around 
{{WRITE_TIMEOUT}} clause. You can remove those and fix indentation. Same 
happens in {{CAS_UNCERTAINTY}} case. 
  * If we add comments for {{activate}} and {{deactivate}} for off/on, maybe 
it's worth to call those off/on?
  * Maybe it's worth to rename {{Filter}} to {{Sink}} and separate filters from 
interceptors, allowing a common class "sink" between those?
  * We don't really need to rename {{equals}} to {{equals0}} 


> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15350
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Feature/Lightweight Transactions
>            Reporter: Alex Petrov
>            Assignee: Yifan Cai
>            Priority: Normal
>              Labels: protocolv5, pull-request-available
>         Attachments: Utf8StringEncodeBench.java
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to