[
https://issues.apache.org/jira/browse/CASSANDRA-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896417#comment-17896417
]
David Capwell commented on CASSANDRA-20059:
-------------------------------------------
bq. I may be missing something and you didn't show where the accord example
gets its retryPolicy from, but if the calling code is going to give up waiting
on the request why use retryIndefinitely instead of using an actual deadline
from at or after in Retry.Deadline?
Sorry for the lack of details. So if used without looking at the deadline then
this is fine as it does what you asked... retry forever. In the accord case
this was given as what we should do when asking for the epochs, so had the
below pattern
{code}
Retry.Deadline deadline =
Retry.Deadline.retryIndefinitely(DatabaseDescriptor.getCmsAwaitTimeout().to(TimeUnit.NANOSECONDS),
TCMMetrics.instance.fetchLogRetries));
return ClusterMetadataService.instance().processor()
.reconstruct(start, end, deadline);
{code}
The issue here was that reconstruct calls "getLogState" and that method in the
RemoteProcessor is
{code}
@Override
public LogState getLogState(Epoch lowEpoch, Epoch highEpoch, boolean
includeSnapshot, Retry.Deadline retryPolicy)
{
try
{
Promise<LogState> request = new AsyncPromise<>();
List<InetAddressAndPort> candidates = new
ArrayList<>(log.metadata().fullCMSMembers());
sendWithCallbackAsync(request,
Verb.TCM_RECONSTRUCT_EPOCH_REQ,
new ReconstructLogState(lowEpoch, highEpoch,
includeSnapshot),
new CandidateIterator(candidates),
retryPolicy);
return request.get(retryPolicy.remainingNanos(),
TimeUnit.NANOSECONDS);
}
catch (InterruptedException e)
{
throw new RuntimeException("Can not reconstruct during shutdown",
e);
}
catch (ExecutionException | TimeoutException e)
{
throw new RuntimeException(String.format("Could not reconstruct
range %d, %d", lowEpoch.getEpoch(), highEpoch.getEpoch()), e);
}
}
{code}
In this case this code path (which is accord only) would give up waiting, but
would keep retrying non stop.
This pattern was found in several places, but if this retry policy is only used
1 time in trunk then accord went a dangerous path
Here is an example from org.apache.cassandra.tcm.FetchCMSLog.Handler#doVerb
{code}
Retry.Deadline retry =
Retry.Deadline.retryIndefinitely(DatabaseDescriptor.getCmsAwaitTimeout().to(TimeUnit.NANOSECONDS),
TCMMetrics.instance.fetchLogRetries);
LogState delta;
if (consistentFetch)
delta = processor.get().getLogState(message.payload.lowerBound,
Epoch.MAX, false, retry);
{code}
And org.apache.cassandra.tcm.ReconstructLogState.Handler#doVerb
{code}
LogState result = processor.get().getLogState(request.lowerBound,
request.higherBound, request.includeSnapshot,
Retry.Deadline.retryIndefinitely(DatabaseDescriptor.getCmsAwaitTimeout().to(TimeUnit.NANOSECONDS),
TCMMetrics.instance.fetchLogRetries));
{code}
> TCM's Retry.Deadline#retryIndefinitely is dangerous if used with
> RemoteProcessor as the deadline does not impact message retries
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-20059
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20059
> Project: Cassandra
> Issue Type: Bug
> Components: Transactional Cluster Metadata
> Reporter: David Capwell
> Priority: Normal
> Fix For: 5.x
>
>
> {code}
> public static Deadline retryIndefinitely(long timeoutNanos, Meter retryMeter)
> {
> return new Deadline(Clock.Global.nanoTime() + timeoutNanos,
> new Retry.Jitter(Integer.MAX_VALUE,
> DEFAULT_BACKOFF_MS, new Random(), retryMeter))
> {
> @Override
> public boolean reachedMax()
> {
> return false;
> }
> @Override
> public long remainingNanos()
> {
> return timeoutNanos;
> }
> public String toString()
> {
> return String.format("RetryIndefinitely{tries=%d}",
> currentTries());
> }
> };
> }
> {code}
> Sample usage pattern (example is in Accord, but same pattern exists in
> RemoteProcessor.commit)
> {code}
> Promise<LogState> request = new AsyncPromise<>();
> List<InetAddressAndPort> candidates = new
> ArrayList<>(log.metadata().fullCMSMembers());
> sendWithCallbackAsync(request,
> Verb.TCM_RECONSTRUCT_EPOCH_REQ,
> new ReconstructLogState(lowEpoch, highEpoch,
> includeSnapshot),
> new CandidateIterator(candidates),
> retryPolicy);
> return request.get(retryPolicy.remainingNanos(), TimeUnit.NANOSECONDS);
> {code}
> The issue here is that the networking retry has no clue that we gave up
> waiting on the request, so we will keep retrying until success! The reason
> for this is “reachedMax” is used to see if its safe to run again, but it
> isn’t as the deadline has passed!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]