[ https://issues.apache.org/jira/browse/IGNITE-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Puchkovskiy updated IGNITE-25805: --------------------------------------- Description: When sending a message, it might be necessary to establish a connection first. After establishing the connection, we check that the recipient ClusterNode's ID matches the ID of the node to which we actually established the connection. If they differ, we throw a RecipientLeftException. This makes sense when the MessagingService actually sends a message by a ClusterNode to identify the recipient. This allows to make sure that the same 'incarnation' of the node that is expected will get the message. But, if the user of MessagingService doesn't care about the specific incarnation (they send by consistent ID or by network address), the check should not be made at all. The effect can be pretty bad. For example, ScaleCube sends its internal messages by network address. Here is a piece of log that demonstrates that the physical topology on the node cannot be updated because getMetadata() makes a request to the new (reappeared) node, but as its ID is different from the one stored in the topology, the request fails, failing the topology update (freezing the topology for possibly infinite time): {noformat} 2025-06-30 15:14:52:230 +0000 [WARNING][sc-cluster-3344-1][MetadataStore] [default:RollingUpgradeTest_cluster_0:736eff04a39940e3@172.30.0.2:3344][25e0caca-ff93-499e-b8e5-93a6b0f9690c] Timeout getting GetMetadataResp from 172.30.0.3:3345 within 3000 ms, cause: org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa 2025-06-30 15:14:52:230 +0000 [WARNING][sc-cluster-3344-1][MembershipProtocol] [default:Test_cluster_0:736eff04a39940e3@172.30.0.2:3344][updateMembership][SYNC] Skipping to add/update member: {m: default:Test_cluster_1:1ed5b18fbdce4afc@172.30.0.3:3345, s: ALIVE, inc: 0}, due to failed fetchMetadata call (cause: org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa){noformat} Another issue is that, when SQL engine sends a request by consistent ID, but the topology is not updated yet locally (as such updates happen asynchronously), we might reject the message send, even though the recipient node is ready to handle the message: {noformat} 2025-06-30 15:14:53:588 +0000 [WARNING][RollingUpgradeTest_cluster_0-network-worker-1][ExecutionServiceImpl] Fragment closing processed with errors: [queryId=ffffffff-c195-43d5-0000-000000000003] org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:9c391022 at org.apache.ignite.internal.network.DefaultMessagingService.lambda$sendViaNetwork$3(DefaultMessagingService.java:360) {noformat} > Target node ID check is too strict > ---------------------------------- > > Key: IGNITE-25805 > URL: https://issues.apache.org/jira/browse/IGNITE-25805 > Project: Ignite > Issue Type: Bug > Reporter: Roman Puchkovskiy > Assignee: Roman Puchkovskiy > Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > When sending a message, it might be necessary to establish a connection > first. After establishing the connection, we check that the recipient > ClusterNode's ID matches the ID of the node to which we actually established > the connection. If they differ, we throw a RecipientLeftException. > This makes sense when the MessagingService actually sends a message by a > ClusterNode to identify the recipient. This allows to make sure that the same > 'incarnation' of the node that is expected will get the message. > But, if the user of MessagingService doesn't care about the specific > incarnation (they send by consistent ID or by network address), the check > should not be made at all. > The effect can be pretty bad. For example, ScaleCube sends its internal > messages by network address. Here is a piece of log that demonstrates that > the physical topology on the node cannot be updated because getMetadata() > makes a request to the new (reappeared) node, but as its ID is different from > the one stored in the topology, the request fails, failing the topology > update (freezing the topology for possibly infinite time): > {noformat} > 2025-06-30 15:14:52:230 +0000 [WARNING][sc-cluster-3344-1][MetadataStore] > [default:RollingUpgradeTest_cluster_0:736eff04a39940e3@172.30.0.2:3344][25e0caca-ff93-499e-b8e5-93a6b0f9690c] > Timeout getting GetMetadataResp from 172.30.0.3:3345 within 3000 ms, cause: > org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 > Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but > 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa > 2025-06-30 15:14:52:230 +0000 > [WARNING][sc-cluster-3344-1][MembershipProtocol] > [default:Test_cluster_0:736eff04a39940e3@172.30.0.2:3344][updateMembership][SYNC] > Skipping to add/update member: {m: > default:Test_cluster_1:1ed5b18fbdce4afc@172.30.0.3:3345, s: ALIVE, inc: 0}, > due to failed fetchMetadata call (cause: > org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 > Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but > 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa){noformat} > Another issue is that, when SQL engine sends a request by consistent ID, but > the topology is not updated yet locally (as such updates happen > asynchronously), we might reject the message send, even though the recipient > node is ready to handle the message: > > {noformat} > 2025-06-30 15:14:53:588 +0000 > [WARNING][RollingUpgradeTest_cluster_0-network-worker-1][ExecutionServiceImpl] > Fragment closing processed with errors: > [queryId=ffffffff-c195-43d5-0000-000000000003] > org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 > Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but > 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:9c391022 > at > org.apache.ignite.internal.network.DefaultMessagingService.lambda$sendViaNetwork$3(DefaultMessagingService.java:360) > > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010)