[jira] [Commented] (CASSANDRA-20059) TCM's Retry.Deadline#retryIndefinitely is dangerous if used with RemoteProcessor as the deadline does not impact message retries

David Capwell (Jira) Thu, 07 Nov 2024 09:46:05 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896424#comment-17896424
 ]


David Capwell commented on CASSANDRA-20059:
-------------------------------------------

I think how this happened was "oh, we really should run until we get the 
result" so looking at the method name yields it getting used as we want to 
retry forever... but retryIndefinitely on a Deadline doesn't really make sense 
to me, as its contradictory... I want to stop after some point, but ignore that 
and run forever?

Looking at the trunk commit logic that calls 
"org.apache.cassandra.tcm.Processor#commit(org.apache.cassandra.tcm.log.Entry.Id,
 org.apache.cassandra.tcm.Transformation, org.apache.cassandra.tcm.Epoch, 
org.apache.cassandra.tcm.Retry.Deadline)" which takes a Deadline.  Walking the 
usage most of the time we only need Retry (where retryIndefinitely would make 
sense) but then you hit some uncommon cases that check the deadline such as 
"org.apache.cassandra.tcm.PaxosBackedProcessor#fetchLogAndWait" which just does

{code}
long nextTimeout = Math.min(retryPolicy.deadlineNanos, Clock.Global.nanoTime() 
+ DatabaseDescriptor.getRpcTimeout(TimeUnit.NANOSECONDS));
{code}

The only other place that checks this is ProgressBarrier

{code}
logger.warn("Could not collect {} of nodes for a progress barrier for epoch {} 
to finish within {}ms. Nodes that have not responded: {}. {}",
                    cl, waitFor, 
TimeUnit.NANOSECONDS.toMillis(deadline.deadlineNanos - start), remaining, 
deadline);
{code}

The method version of that field is "remainingNanos" and that really is only 
used in 
"org.apache.cassandra.tcm.RemoteProcessor#fetchLogAndWait(org.apache.cassandra.tcm.Epoch,
 org.apache.cassandra.tcm.Retry.Deadline)"

This method doesn't retry anything, it just has a duration in which we wish to 
timeout... so this field is only really used by the paxos version of 
"fetchLogAndWait".

> TCM's Retry.Deadline#retryIndefinitely is dangerous if used with 
> RemoteProcessor as the deadline does not impact message retries
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20059
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20059
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Transactional Cluster Metadata
>            Reporter: David Capwell
>            Priority: Normal
>             Fix For: 5.x
>
>
> {code}
> public static Deadline retryIndefinitely(long timeoutNanos, Meter retryMeter)
> {
>     return new Deadline(Clock.Global.nanoTime() + timeoutNanos,
>                         new Retry.Jitter(Integer.MAX_VALUE, 
> DEFAULT_BACKOFF_MS, new Random(), retryMeter))
>     {
>         @Override
>         public boolean reachedMax()
>         {
>             return false;
>         }
>         @Override
>         public long remainingNanos()
>         {
>             return timeoutNanos;
>         }
>         public String toString()
>         {
>             return String.format("RetryIndefinitely{tries=%d}", 
> currentTries());
>         }
>     };
> }
> {code}
> Sample usage pattern (example is in Accord, but same pattern exists in 
> RemoteProcessor.commit)
> {code}
> Promise<LogState> request = new AsyncPromise<>();
> List<InetAddressAndPort> candidates = new 
> ArrayList<>(log.metadata().fullCMSMembers());
> sendWithCallbackAsync(request,
>                       Verb.TCM_RECONSTRUCT_EPOCH_REQ,
>                       new ReconstructLogState(lowEpoch, highEpoch, 
> includeSnapshot),
>                       new CandidateIterator(candidates),
>                       retryPolicy);
> return request.get(retryPolicy.remainingNanos(), TimeUnit.NANOSECONDS);
> {code}
> The issue here is that the networking retry has no clue that we gave up 
> waiting on the request, so we will keep retrying until success!  The reason 
> for this is “reachedMax” is used to see if its safe to run again, but it 
> isn’t as the deadline has passed!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-20059) TCM's Retry.Deadline#retryIndefinitely is dangerous if used with RemoteProcessor as the deadline does not impact message retries

Reply via email to