[ https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985099#comment-16985099 ]
Alex Petrov commented on CASSANDRA-15350: ----------------------------------------- [~yifanc] thank you for the patch. I agree with [~spod] notes on the naming. Maybe other potential name could be something like {{CASUnknownResultException}}? Regarding whether you'd like to retry or not. Since contention itself is a result of concurrency, backing off and not retrying may actually a good idea. In case of uncertainty, I'd say you'd usually retry with another write rather than reading and then writing, at least that's how I would imagine application logic in such case. In case of a failed subsequent write, you'll also get a reason in form of a new value. Similar naming should apply to {{uncertainties}} in metrics, etc. Comments: * In {{ErrorMessage#decode}}, you're unconditionally changing the protocol [here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR126]. If I understand correctly, we can still have a combination of {{WriteTimeout}} and {{CAS}} in {{v4}}. So if we receive a message from {{v4}} on a {{v5}} node, we won't be able to interpret it correctly. You can try and write in-jvm test to test for that, or use in-jvm test facilities to try and encode message on one version of the node and decode it on the other. * {{ErrorMessage#encode}} we seem to be aided by the fact that {{getBackwardsCompatibleException}} gives us the right exception. However, since these two parts of code are disjoint, and there's no guarantee future changes will be careful, I'd add an assertion that [this code path|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR242] should be possible only under {{v5}}. * Same logic as above applies for size calculation [here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR307] and [here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR325]. * {{CAS_UNCERTAINTY}} clause is new in protocol {{v5}}. We should not attempt to interpret or write it for the older protocol versions. In fact, for older protocol version we should still encode it as write timeout to be fully compatible. * I'd argue that "uncertainties" should not be counted towards timeouts like [here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-71f06c193f5b5e270cf8ac695164f43aR295]. But someone with better operational insight might be a better person to answer this. * Unless you're submitting patches to 2.2, 3.0, and 3.11, let's roll back changes to {{IMessageFilters}}, since their API has to be binary compatible with older versions. * Should we add timeout tests for responses as well as requests in {{CasWriteTest}}? * Is it possible to try and simplify {{testCasWriteTimeoutDueToContention}}, can we achieve contention with {{N}} threads? * Similar comment for {{testCasWriteUncertainty_ProposalNotReplayed}}: both tests peer quite a lot into implementation internals. If we can't do this, I'd just leave the tests out from commit since it might be difficult to maintain them. I've also been able to catch a failure of {{testCasWriteUncertainty_ProposalNotReplayed}} after several thousand runs. Maybe we can use fuzzing to reproduce both scenarios? * Can we add a cross-version ser-de test, too? Some nits: * In {{ErrorMessage#decode}}, there are extra brackets around {{WRITE_TIMEOUT}} clause. You can remove those and fix indentation. Same happens in {{CAS_UNCERTAINTY}} case. * If we add comments for {{activate}} and {{deactivate}} for off/on, maybe it's worth to call those off/on? * Maybe it's worth to rename {{Filter}} to {{Sink}} and separate filters from interceptors, allowing a common class "sink" between those? * We don't really need to rename {{equals}} to {{equals0}} > Add CAS “uncertainty” and “contention" messages that are currently propagated > as a WriteTimeoutException. > --------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-15350 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15350 > Project: Cassandra > Issue Type: Improvement > Components: Feature/Lightweight Transactions > Reporter: Alex Petrov > Assignee: Yifan Cai > Priority: Normal > Labels: protocolv5, pull-request-available > Attachments: Utf8StringEncodeBench.java > > Time Spent: 20m > Remaining Estimate: 0h > > Right now, CAS uncertainty introduced in > https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as > WriteTimeout. One of this conditions it manifests is when there’s at least > one acceptor that has accepted the value, which means that this value _may_ > still get accepted during the later round, despite the proposer failure. > Similar problem happens with CAS contention, which is also indistinguishable > from the “regular” timeout, even though it is visible in metrics correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org