[ 
https://issues.apache.org/jira/browse/CASSANDRA-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896417#comment-17896417
 ] 

David Capwell commented on CASSANDRA-20059:
-------------------------------------------

bq. I may be missing something and you didn't show where the accord example 
gets its retryPolicy from, but if the calling code is going to give up waiting 
on the request why use retryIndefinitely instead of using an actual deadline 
from at or after in Retry.Deadline?

Sorry for the lack of details.  So if used without looking at the deadline then 
this is fine as it does what you asked... retry forever.  In the accord case 
this was given as what we should do when asking for the epochs, so had the 
below pattern

{code}
Retry.Deadline deadline = 
Retry.Deadline.retryIndefinitely(DatabaseDescriptor.getCmsAwaitTimeout().to(TimeUnit.NANOSECONDS),
                                                           
TCMMetrics.instance.fetchLogRetries));
        return ClusterMetadataService.instance().processor()
                                             .reconstruct(start, end, deadline);
{code}

The issue here was that reconstruct calls "getLogState" and that method in the 
RemoteProcessor is 

{code}
@Override
    public LogState getLogState(Epoch lowEpoch, Epoch highEpoch, boolean 
includeSnapshot, Retry.Deadline retryPolicy)
    {
        try
        {
            Promise<LogState> request = new AsyncPromise<>();
            List<InetAddressAndPort> candidates = new 
ArrayList<>(log.metadata().fullCMSMembers());
            sendWithCallbackAsync(request,
                                  Verb.TCM_RECONSTRUCT_EPOCH_REQ,
                                  new ReconstructLogState(lowEpoch, highEpoch, 
includeSnapshot),
                                  new CandidateIterator(candidates),
                                  retryPolicy);
            return request.get(retryPolicy.remainingNanos(), 
TimeUnit.NANOSECONDS);
        }
        catch (InterruptedException e)
        {
            throw new RuntimeException("Can not reconstruct during shutdown", 
e);
        }
        catch (ExecutionException | TimeoutException e)
        {
            throw new RuntimeException(String.format("Could not reconstruct 
range %d, %d", lowEpoch.getEpoch(), highEpoch.getEpoch()), e);
        }
    }
{code}

In this case this code path (which is accord only) would give up waiting, but 
would keep retrying non stop.

This pattern was found in several places, but if this retry policy is only used 
1 time in trunk then accord went a dangerous path

Here is an example from org.apache.cassandra.tcm.FetchCMSLog.Handler#doVerb

{code}
Retry.Deadline retry = 
Retry.Deadline.retryIndefinitely(DatabaseDescriptor.getCmsAwaitTimeout().to(TimeUnit.NANOSECONDS),
                                                                    
TCMMetrics.instance.fetchLogRetries);
            LogState delta;
            if (consistentFetch)
                delta = processor.get().getLogState(message.payload.lowerBound, 
Epoch.MAX, false, retry);
{code}

And org.apache.cassandra.tcm.ReconstructLogState.Handler#doVerb

{code}
LogState result = processor.get().getLogState(request.lowerBound, 
request.higherBound, request.includeSnapshot,
                                                          
Retry.Deadline.retryIndefinitely(DatabaseDescriptor.getCmsAwaitTimeout().to(TimeUnit.NANOSECONDS),
                                                                                
           TCMMetrics.instance.fetchLogRetries));
{code}

> TCM's Retry.Deadline#retryIndefinitely is dangerous if used with 
> RemoteProcessor as the deadline does not impact message retries
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20059
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20059
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Transactional Cluster Metadata
>            Reporter: David Capwell
>            Priority: Normal
>             Fix For: 5.x
>
>
> {code}
> public static Deadline retryIndefinitely(long timeoutNanos, Meter retryMeter)
> {
>     return new Deadline(Clock.Global.nanoTime() + timeoutNanos,
>                         new Retry.Jitter(Integer.MAX_VALUE, 
> DEFAULT_BACKOFF_MS, new Random(), retryMeter))
>     {
>         @Override
>         public boolean reachedMax()
>         {
>             return false;
>         }
>         @Override
>         public long remainingNanos()
>         {
>             return timeoutNanos;
>         }
>         public String toString()
>         {
>             return String.format("RetryIndefinitely{tries=%d}", 
> currentTries());
>         }
>     };
> }
> {code}
> Sample usage pattern (example is in Accord, but same pattern exists in 
> RemoteProcessor.commit)
> {code}
> Promise<LogState> request = new AsyncPromise<>();
> List<InetAddressAndPort> candidates = new 
> ArrayList<>(log.metadata().fullCMSMembers());
> sendWithCallbackAsync(request,
>                       Verb.TCM_RECONSTRUCT_EPOCH_REQ,
>                       new ReconstructLogState(lowEpoch, highEpoch, 
> includeSnapshot),
>                       new CandidateIterator(candidates),
>                       retryPolicy);
> return request.get(retryPolicy.remainingNanos(), TimeUnit.NANOSECONDS);
> {code}
> The issue here is that the networking retry has no clue that we gave up 
> waiting on the request, so we will keep retrying until success!  The reason 
> for this is “reachedMax” is used to see if its safe to run again, but it 
> isn’t as the deadline has passed!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to