ifesdjeen commented on code in PR #3196:
URL: https://github.com/apache/cassandra/pull/3196#discussion_r1559437385
##########
test/distributed/org/apache/cassandra/distributed/test/TransientRangeMovement2Test.java:
##########
@@ -51,9 +59,30 @@ public void testMoveBackward() throws IOException,
ExecutionException, Interrupt
.start()))
{
populate(cluster);
- cluster.get(3).nodetoolResult("move", "25").asserts().success();
+ // At the start, node1 is a TRANSIENT replica for (20,30] and FULL
for (30, 40], (40,] & (,10]. When moving
+ // node3 to token 25, node1 becomes a FULL replica for (25, 40],
effectively going from TRANSIENT to FULL
+ // for (25,30]. A T->F transition will always cause data for that
range to be streamed to the transitioning
+ // node, which happens after StartMove and before MidMove. Running
cleanup before node1 considers itself a
+ // FULL replica would remove any of the newly streamed data which
is marked repaired. To avoid this, we
+ // ensure that any T->F transition is applied for writes as part
of the StartMove. Have the CMS node (node1)
+ // pause before the MidMove step is committed, at which point we
know that streaming has completed.
+ Callable<Epoch> pending = pauseBeforeCommit(cluster.get(1), (e) ->
e instanceof PrepareMove.MidMove);
+ new Thread(() -> cluster.get(3).nodetoolResult("move",
"25").asserts().success()).start();
Review Comment:
Would you mind to add a `join` after we have finished with this thread below?
##########
src/java/org/apache/cassandra/tcm/sequences/RemoveNodeStreams.java:
##########
@@ -131,33 +131,12 @@ private static MovementMap movementMap(InetAddressAndPort
leaving, ClusterMetada
if (!replica.endpoint().equals(leaving)
&& !replica.endpoint().equals(newReplica.endpoint()))
candidateBuilder.add(replica,
ReplicaCollection.Builder.Conflict.NONE);
});
- movements.putAll(newReplica,
candidateBuilder.build(), ReplicaCollection.Builder.Conflict.NONE);
+ EndpointsForRange sources =
candidateBuilder.build();
+ // log if newReplica is an existing
transient replica moving to a full replica
Review Comment:
We can make a similar change in `UnbootstrapStreams` calculation of the
movement map, change the mid check to
```
startWriteAdditions.get(removal.endpoint()).contains(removal.range(), true);
```
##########
src/java/org/apache/cassandra/tcm/sequences/RemoveNodeStreams.java:
##########
@@ -131,33 +131,12 @@ private static MovementMap movementMap(InetAddressAndPort
leaving, ClusterMetada
if (!replica.endpoint().equals(leaving)
&& !replica.endpoint().equals(newReplica.endpoint()))
candidateBuilder.add(replica,
ReplicaCollection.Builder.Conflict.NONE);
});
- movements.putAll(newReplica,
candidateBuilder.build(), ReplicaCollection.Builder.Conflict.NONE);
+ EndpointsForRange sources =
candidateBuilder.build();
+ // log if newReplica is an existing
transient replica moving to a full replica
+ if
(startWriteRemovals.get(newReplica.endpoint()).contains(newReplica.range(),
false))
+ logger.debug("Streaming transient ->
full conversion to {} from {}", newReplica.endpoint(), sources);
+ movements.putAll(newReplica, sources,
ReplicaCollection.Builder.Conflict.NONE);
});
- // and check if any replicas went from transient -> full:
Review Comment:
it also happens that we do not need mid and finish delta for this method
anymore. Should we remove them?
##########
src/java/org/apache/cassandra/tcm/transformations/PrepareMove.java:
##########
@@ -103,7 +104,9 @@ public Result execute(ClusterMetadata prev)
StartMove startMove = new StartMove(nodeId,
transitionPlan.addToWrites(), lockKey);
MidMove midMove = new MidMove(nodeId, transitionPlan.moveReads(),
lockKey);
FinishMove finishMove = new FinishMove(nodeId, tokens,
transitionPlan.removeFromWrites(), lockKey);
-
+ Result res =
PlacementTransitionPlan.assertPreExistingWriteReplica(prev.placements,
transitionPlan);
Review Comment:
Do you folks think maybe having an Option here could be useful? Or some form
of polymorphic result that would visually explain the significance of `null`
equivalent?
##########
test/distributed/org/apache/cassandra/distributed/test/log/PlacementSimulator.java:
##########
@@ -985,20 +977,84 @@ public Diff<T> invert()
}
}
- public static String diffsToString(Map<Range, Diff<Node>> placements)
+ public static Diff<Replica> additionsAndTransientToFull(Diff<Replica>
unfiltered)
+ {
+ if (unfiltered.additions.isEmpty())
+ return null;
+ List<Replica> additions = new ArrayList<>(unfiltered.additions.size());
+ List<Replica> removals = new ArrayList<>(unfiltered.additions.size());
+ for (Replica added : unfiltered.additions)
+ {
+ // Include any new FULL replicas here. If there exists the removal
of a corresponding Transient
+ // replica (i.e. a switch from T -> F), add that too. We want T ->
F transitions to happen early
+ // in a multi-step operation, at the same time as new write
replicas are added.
+ if (added.isFull())
+ {
+ additions.add(added);
+ for (Replica removed : unfiltered.removals)
+ if (removed.node().equals(added.node()) &&
removed.isTransient())
+ removals.add(removed);
+ }
+ else
+ {
+ // Conversely, when a replica transitions from F -> T, it's
enacted late in a multi-step operation.
+ // So only include TRANSIENT additions if there is no removal
of a corresponding FULL replica.
+ boolean include = unfiltered.removals.stream()
+ .noneMatch(removed ->
removed.node().equals(added.node())
+ &&
removed.isFull());
+ if (include)
+ additions.add(added);
+ }
+ }
+ return new Diff<>(additions, removals);
+ }
+
+ public static Diff<Replica> removalsAndFullToTransient(Diff<Replica>
unfiltered)
+ {
+ if (unfiltered.removals.isEmpty())
+ return null;
+ List<Replica> additions = new ArrayList<>(unfiltered.removals.size());
+ List<Replica> removals = new ArrayList<>(unfiltered.removals.size());
+ for (Replica removed : unfiltered.removals)
+ {
+ // Include any new FULL replicas here. If there exists the removal
of a corresponding Transient
+ // replica (i.e. a switch from T -> F), add that too. We want T ->
F transitions to happen early
+ // in a multi-step operation, at the same time as new write
replicas are added.
+ if (removed.isFull())
+ {
+ removals.add(removed);
+ for (Replica added : unfiltered.additions)
Review Comment:
Nit: I think same comment applies then to the above
`additionsAndTransientToFull` code.
##########
test/harry/main/org/apache/cassandra/harry/sut/TokenPlacementModel.java:
##########
@@ -62,18 +75,34 @@ public ReplicatedRanges replicate(List<Node> nodes)
public abstract ReplicatedRanges replicate(Range[] ranges, List<Node>
nodes);
}
+ public static class DCReplicas
+ {
+ public final int totalCount;
Review Comment:
Nit: should we try keeping API here consistent with RF?
```
public final int allReplicas;
public final int fullReplicas;
```
##########
test/distributed/org/apache/cassandra/distributed/test/log/PlacementSimulator.java:
##########
@@ -985,20 +977,84 @@ public Diff<T> invert()
}
}
- public static String diffsToString(Map<Range, Diff<Node>> placements)
+ public static Diff<Replica> additionsAndTransientToFull(Diff<Replica>
unfiltered)
+ {
+ if (unfiltered.additions.isEmpty())
+ return null;
+ List<Replica> additions = new ArrayList<>(unfiltered.additions.size());
+ List<Replica> removals = new ArrayList<>(unfiltered.additions.size());
+ for (Replica added : unfiltered.additions)
+ {
+ // Include any new FULL replicas here. If there exists the removal
of a corresponding Transient
+ // replica (i.e. a switch from T -> F), add that too. We want T ->
F transitions to happen early
+ // in a multi-step operation, at the same time as new write
replicas are added.
+ if (added.isFull())
+ {
+ additions.add(added);
+ for (Replica removed : unfiltered.removals)
Review Comment:
Nit: should use same style for filtering / checking removals in both
branches?
I _think_ we can also break out from the loop here early if we did add a new
entry to `removals`.
##########
test/distributed/org/apache/cassandra/distributed/test/TransientRangeMovement2Test.java:
##########
@@ -51,9 +59,30 @@ public void testMoveBackward() throws IOException,
ExecutionException, Interrupt
.start()))
{
populate(cluster);
- cluster.get(3).nodetoolResult("move", "25").asserts().success();
+ // At the start, node1 is a TRANSIENT replica for (20,30] and FULL
for (30, 40], (40,] & (,10]. When moving
+ // node3 to token 25, node1 becomes a FULL replica for (25, 40],
effectively going from TRANSIENT to FULL
+ // for (25,30]. A T->F transition will always cause data for that
range to be streamed to the transitioning
+ // node, which happens after StartMove and before MidMove. Running
cleanup before node1 considers itself a
+ // FULL replica would remove any of the newly streamed data which
is marked repaired. To avoid this, we
+ // ensure that any T->F transition is applied for writes as part
of the StartMove. Have the CMS node (node1)
+ // pause before the MidMove step is committed, at which point we
know that streaming has completed.
+ Callable<Epoch> pending = pauseBeforeCommit(cluster.get(1), (e) ->
e instanceof PrepareMove.MidMove);
+ new Thread(() -> cluster.get(3).nodetoolResult("move",
"25").asserts().success()).start();
+
+ // Before allowing node1 to proceed with committing the MidMove,
instruct it to pause node1 before it
Review Comment:
nit: This line reads a bit tautological, would maybe something like "To
gate/prevent node1 from proceeding with committing MidMove, instruct it to
pause before enacting it" ?
##########
src/java/org/apache/cassandra/tcm/transformations/PrepareJoin.java:
##########
@@ -156,6 +153,12 @@ public Result execute(ClusterMetadata prev)
transitionPlan.toSplit,
startJoin,
midJoin, finishJoin,
joinTokenRing,
streamData);
+ if (!(this instanceof UnsafeJoin) && !prev.tokenMap.isEmpty())
Review Comment:
Would you be OK to move this logic to an instance method, and have
implemenation in PrepareJoin, and override it in UnsafeJoin? It would be great
if we could keep it polymorphic.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]