sashapolo commented on code in PR #7248:
URL: https://github.com/apache/ignite-3/pull/7248#discussion_r2675350366
##########
modules/network-api/src/main/java/org/apache/ignite/internal/network/JoinedNodes.java:
##########
@@ -26,13 +26,15 @@ public interface JoinedNodes {
* Called when the node joins logical topology.
*
* @param node Node.
+ * @param topologyVersion Logical topology version.
*/
- void onJoined(InternalClusterNode node);
+ void onJoined(InternalClusterNode node, long topologyVersion);
Review Comment:
Not related to this PR, but this class' name is awful, would be nice if we
could rename it
##########
modules/network-api/src/main/java/org/apache/ignite/internal/network/AbstractTopologyService.java:
##########
@@ -28,6 +28,8 @@ public abstract class AbstractTopologyService implements
TopologyService {
/** Registered event handlers. */
private final Collection<TopologyEventHandler> eventHandlers = new
CopyOnWriteArrayList<>();
+ protected long topologyVersion;
Review Comment:
How do you guarantee visibility of this field? All other fields in this and
related classes allow concurrent access
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/message/StaleNodeHandlingParams.java:
##########
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.network.recovery.message;
+
+/**
+ * Parameters required for handling stale state of the node.
+ */
+public interface StaleNodeHandlingParams {
Review Comment:
I personally prefer `Paratemers` instead of `Params`, but this is obviously
a matter of taste
##########
modules/network-api/src/main/java/org/apache/ignite/internal/network/AbstractTopologyService.java:
##########
@@ -28,6 +28,8 @@ public abstract class AbstractTopologyService implements
TopologyService {
/** Registered event handlers. */
private final Collection<TopologyEventHandler> eventHandlers = new
CopyOnWriteArrayList<>();
+ protected long topologyVersion;
Review Comment:
Also, I personally don't like having mutable protected fields. I would
suggest to move this field into `ScaleCubeTopologyService`
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryInitiatorHandshakeManager.java:
##########
@@ -380,6 +379,8 @@ private static boolean clusterIdMismatch(@Nullable UUID
acceptorClusterId, @Null
}
private void handleStaleAcceptorId(HandshakeStartMessage msg) {
+ maybeFailOnStaleNodeDetection(failureProcessor, new
StaleNodeHandlingParamsImpl(topologyService), msg);
Review Comment:
Same here
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryInitiatorHandshakeManager.java:
##########
@@ -508,6 +505,17 @@ private void handshake(RecoveryDescriptor descriptor) {
});
}
+ protected HandshakeStartResponseMessage
createHandshakeStartResponseMessage(RecoveryDescriptor descriptor) {
+ StaleNodeHandlingParamsImpl params = new
StaleNodeHandlingParamsImpl(topologyService);
Review Comment:
Same question about this object
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryInitiatorHandshakeManager.java:
##########
@@ -535,4 +543,5 @@ protected void finishHandshake() {
void setRemoteNode(InternalClusterNode remoteNode) {
this.remoteNode = remoteNode;
}
+
Review Comment:
```suggestion
```
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/message/StaleNodeHandlingParams.java:
##########
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.network.recovery.message;
+
+/**
+ * Parameters required for handling stale state of the node.
Review Comment:
```suggestion
* Parameters required for handling stale state of a node.
```
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryAcceptorHandshakeManager.java:
##########
@@ -257,6 +283,8 @@ private boolean
possiblyRejectHandshakeStartResponse(HandshakeStartResponseMessa
}
private void handleStaleInitiatorId(HandshakeStartResponseMessage msg) {
+ maybeFailOnStaleNodeDetection(failureProcessor, new
StaleNodeHandlingParamsImpl(topologyService), msg);
Review Comment:
Should we send the response and only then fail the node?
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/HandshakeManagerUtils.java:
##########
@@ -88,4 +92,19 @@ static Exception
createExceptionFromRejectionMessage(HandshakeRejectedMessage ms
? new RecipientLeftException(msg.message())
: new HandshakeException(msg.message());
}
+
+ static void maybeFailOnStaleNodeDetection(
+ FailureProcessor failureProcessor,
+ StaleNodeHandlingParams local,
+ StaleNodeHandlingParams remote
+ ) {
+ long localTopologyVersion = local.topologyVersion();
+ long remoteTopologyVersion = remote.topologyVersion();
+
+ if (localTopologyVersion >= remoteTopologyVersion) {
+ return;
+ }
+
+ failureProcessor.process(new
FailureContext(FailureType.CRITICAL_ERROR, null, "Node is segmented."));
Review Comment:
I think we need to print more information, rather than `"Node is
segmented."`. Let's write something `"Node is unable to join the cluster
because its handshake has been rejected, possibly because this node suffered
from a network segmentation, local topology version = blabla, remote topology
version = blabla"`
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/HandshakeManagerUtils.java:
##########
@@ -88,4 +92,19 @@ static Exception
createExceptionFromRejectionMessage(HandshakeRejectedMessage ms
? new RecipientLeftException(msg.message())
: new HandshakeException(msg.message());
}
+
+ static void maybeFailOnStaleNodeDetection(
+ FailureProcessor failureProcessor,
+ StaleNodeHandlingParams local,
+ StaleNodeHandlingParams remote
+ ) {
+ long localTopologyVersion = local.topologyVersion();
+ long remoteTopologyVersion = remote.topologyVersion();
+
+ if (localTopologyVersion >= remoteTopologyVersion) {
+ return;
+ }
+
+ failureProcessor.process(new
FailureContext(FailureType.CRITICAL_ERROR, null, "Node is segmented."));
Review Comment:
Not related to this PR, but rather a general question. Do we still use a
no-op failure handler? What does the current behavior look like when a
partition happens, does the offending node just get stuck unable to connect to
anybody?
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryAcceptorHandshakeManager.java:
##########
@@ -192,6 +206,18 @@ public void onConnectionOpen() {
});
}
+ protected HandshakeStartMessage createHandshakeStartMessage() {
+ StaleNodeHandlingParamsImpl params = new
StaleNodeHandlingParamsImpl(topologyService);
Review Comment:
Why do you need this object?
##########
modules/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryAcceptorHandshakeManager.java:
##########
@@ -192,6 +206,18 @@ public void onConnectionOpen() {
});
}
+ protected HandshakeStartMessage createHandshakeStartMessage() {
Review Comment:
```suggestion
private HandshakeStartMessage createHandshakeStartMessage() {
```
##########
modules/network/src/integrationTest/java/org/apache/ignite/internal/network/node/ItNodeStalenessAndRestartTest.java:
##########
@@ -59,13 +72,36 @@ void nodeStalenessStatusIsClearedOnRestart() throws
Exception {
);
}
+ @Test
+ @ConfigOverride(name = "ignite.failureHandler.handler.type", value =
"stop")
+ @MuteFailureManagerLogging
+ void staleNodeIsShutDown() throws Exception {
+ IgniteImpl ignite0 = unwrapIgniteImpl(cluster.node(0));
+
+ LogInspector logInspector = new LogInspector(
+ FailureManager.class.getName(),
+ evt -> evt.getLevel() == Level.ERROR
+ &&
evt.getMessage().getFormattedMessage().contains(FAILURE_MESSAGE)
+ &&
Thread.currentThread().getName().contains(cluster.nodeName(1))
+ );
+
+ logInspector.start();
+ try {
+ simulateNetworkPartition(ignite0);
+
+ await().timeout(10, SECONDS).until(logInspector::isMatched);
Review Comment:
Is it possible to enable the node failing manager and just check that the
target node has been stopped?
##########
modules/network/src/integrationTest/java/org/apache/ignite/internal/network/node/ItNodeStalenessAndRestartTest.java:
##########
@@ -21,17 +21,29 @@
import static
org.apache.ignite.internal.ConfigTemplates.FAST_FAILURE_DETECTION_NODE_BOOTSTRAP_CFG_TEMPLATE;
import static org.apache.ignite.internal.TestWrappers.unwrapIgniteImpl;
import static
org.apache.ignite.internal.testframework.IgniteTestUtils.waitForCondition;
+import static org.awaitility.Awaitility.await;
import static org.junit.jupiter.api.Assertions.assertTrue;
import java.util.concurrent.CountDownLatch;
import org.apache.ignite.internal.ClusterPerTestIntegrationTest;
+import org.apache.ignite.internal.ConfigOverride;
import org.apache.ignite.internal.app.IgniteImpl;
+import org.apache.ignite.internal.failure.FailureManager;
import org.apache.ignite.internal.network.InternalClusterNode;
import org.apache.ignite.internal.network.TopologyEventHandler;
import org.apache.ignite.internal.network.message.ScaleCubeMessage;
+import
org.apache.ignite.internal.testframework.failure.FailureManagerExtension;
+import
org.apache.ignite.internal.testframework.failure.MuteFailureManagerLogging;
+import org.apache.ignite.internal.testframework.log4j2.LogInspector;
+import org.apache.logging.log4j.Level;
import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.extension.ExtendWith;
+@ExtendWith(FailureManagerExtension.class)
class ItNodeStalenessAndRestartTest extends ClusterPerTestIntegrationTest {
+
+ private static final String FAILURE_MESSAGE = "Ignite node is in invalid
state due to a critical failure.";
Review Comment:
It would be nice to check for a particular exception message, not just a
general one from the failure manager
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]