beobal commented on code in PR #3502: URL: https://github.com/apache/cassandra/pull/3502#discussion_r1743697811
########## src/java/org/apache/cassandra/tcm/log/LogReader.java: ########## @@ -60,6 +60,11 @@ public interface LogReader * case we return all subsequent entries in the log. */ default LogState getLogState(Epoch startEpoch) + { + return getLogState(startEpoch, true); + } + + default LogState getLogState(Epoch startEpoch, boolean allowSnapshots) Review Comment: note: this is also not strictly related to the issue at hand, but was spotted during testing. When constructing a `LogState` for inclusion in a success or failure response to a `Commit` we never want to include metadata snapshots, only entries (see `LocalLog::getCommittedEntries`) ########## src/java/org/apache/cassandra/tcm/log/LogReader.java: ########## @@ -69,11 +74,13 @@ default LogState getLogState(Epoch startEpoch) // If there is at most 1 snapshot with an epoch > startEpoch, we prefer to skip that snapshot and just build a // list of consecutive entries - if (snapshotEpochs.size() <= 1) + if (snapshotEpochs.size() <= 1 || !allowSnapshots) { entries = getEntries(startEpoch); if (entries.isContinuous()) return new LogState(null, entries.immutable()); + else if (!allowSnapshots) + throw new IllegalStateException("Can't construct a continious log since " + startEpoch + " and we don't allow snapshots"); Review Comment: typo: `continious` ########## src/java/org/apache/cassandra/tcm/ClusterMetadataService.java: ########## @@ -844,11 +844,23 @@ private Pair<State, Processor> delegateInternal() @Override public Commit.Result commit(Entry.Id entryId, Transformation transform, Epoch lastKnown, Retry.Deadline retryPolicy) { - Pair<State, Processor> delegate = delegateInternal(); - Commit.Result result = delegate.right.commit(entryId, transform, lastKnown, retryPolicy); - if (delegate.left == LOCAL || delegate.left == RESET) - replicator.send(result, null); - return result; + while (!retryPolicy.reachedMax()) Review Comment: The underlying processor or the choice of which processor to delegate to is the issue here so I don't see how that can be pushed down into the delegated-to processor. Using the `retryPolicy` to manage retries seems like the right thing to do as it will maintain the state of the retries if a `NotCMSException` is encountered. e.g.s * node is a CMS member, so delegate to a local processor * local processor makes several attempts to commit, but is unable to due to competing commits * CMS membership changes as a result of one of those competing commits, node is no longer a member * on the next retry, local processor throws `NotCMSException` * which is caught in `SwitchableProcessor`, which itself now retries and obtains a new, remote delegate * committing via that remote delegate shouldn't reset the retry policy (i.e. if it had a `maxTries` the total attempts across both delegated processors should count toward it). ########## src/java/org/apache/cassandra/tcm/log/LocalLog.java: ########## @@ -894,9 +894,9 @@ private LogListener snapshotListener() if ((entry.epoch.getEpoch() % DatabaseDescriptor.getMetadataSnapshotFrequency()) == 0) { - List<InetAddressAndPort> list = new ArrayList<>(ClusterMetadata.current().fullCMSMembers()); - list.sort(comparing(i -> i.addressBytes[i.addressBytes.length - 1])); - if (list.get(0).equals(FBUtilities.getBroadcastAddressAndPort())) + List<NodeId> list = new ArrayList<>(metadata.success().metadata.fullCMSMemberIds()); Review Comment: note: this is unrelated to the issue being addressed here, but removes the chance of non-determinism where nodes share the same the last address byte ########## src/java/org/apache/cassandra/tcm/ClusterMetadataService.java: ########## @@ -844,11 +844,23 @@ private Pair<State, Processor> delegateInternal() @Override public Commit.Result commit(Entry.Id entryId, Transformation transform, Epoch lastKnown, Retry.Deadline retryPolicy) { - Pair<State, Processor> delegate = delegateInternal(); - Commit.Result result = delegate.right.commit(entryId, transform, lastKnown, retryPolicy); - if (delegate.left == LOCAL || delegate.left == RESET) - replicator.send(result, null); - return result; + while (!retryPolicy.reachedMax()) + { + try + { + Pair<State, Processor> delegate = delegateInternal(); + Commit.Result result = delegate.right.commit(entryId, transform, lastKnown, retryPolicy); + ClusterMetadataService.State state = delegate.left; + if (state == LOCAL || state == RESET) + replicator.send(result, null); + return result; + } + catch (NotCMSException e) + { + retryPolicy.maybeSleep(); + } + } + return Commit.Result.failed(ExceptionCode.SERVER_ERROR, "Could not commit " + transform.kind() + " at epoch " + lastKnown); Review Comment: It seems we're rather inconsistent in this class and have a mix of both concatenation and format when throwing exceptions. The only real benefit of format is readability, which I don't think is terrible anyway so +0 ¯\_(ツ)_/¯ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: pr-unsubscr...@cassandra.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: pr-unsubscr...@cassandra.apache.org For additional commands, e-mail: pr-h...@cassandra.apache.org