[ 
https://issues.apache.org/jira/browse/IGNITE-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-25805:
---------------------------------------
    Description: 
When sending a message, it might be necessary to establish a connection first. 
After establishing the connection, we check that the recipient ClusterNode's ID 
matches the ID of the node to which we actually established the connection. If 
they differ, we throw a RecipientLeftException.

This makes sense when the MessagingService actually sends a message by a 
ClusterNode to identify the recipient. This allows to make sure that the same 
'incarnation' of the node that is expected will get the message.

But, if the user of MessagingService doesn't care about the specific 
incarnation (they send by consistent ID or by network address), the check 
should not be made at all.

The effect can be pretty bad. For example, ScaleCube sends its internal 
messages by network address. Here is a piece of log that demonstrates that the 
physical topology on the node cannot be updated because getMetadata() makes a 
request to the new (reappeared) node, but as its ID is different from the one 
stored in the topology, the request fails, failing the topology update 
(freezing the topology for possibly infinite time):
{noformat}
2025-06-30 15:14:52:230 +0000 [WARNING][sc-cluster-3344-1][MetadataStore] 
[default:RollingUpgradeTest_cluster_0:736eff04a39940e3@172.30.0.2:3344][25e0caca-ff93-499e-b8e5-93a6b0f9690c]
 Timeout getting GetMetadataResp from 172.30.0.3:3345 within 3000 ms, cause: 
org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 Target 
node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 
1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa
2025-06-30 15:14:52:230 +0000 [WARNING][sc-cluster-3344-1][MembershipProtocol] 
[default:Test_cluster_0:736eff04a39940e3@172.30.0.2:3344][updateMembership][SYNC]
 Skipping to add/update member: {m: 
default:Test_cluster_1:1ed5b18fbdce4afc@172.30.0.3:3345, s: ALIVE, inc: 0}, due 
to failed fetchMetadata call (cause: 
org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 Target 
node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 
1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa){noformat}
Another issue is that, when SQL engine sends a request by consistent ID, but 
the topology is not updated yet locally (as such updates happen 
asynchronously), we might reject the message send, even though the recipient 
node is ready to handle the message:

 
{noformat}
2025-06-30 15:14:53:588 +0000 
[WARNING][RollingUpgradeTest_cluster_0-network-worker-1][ExecutionServiceImpl] 
Fragment closing processed with errors: 
[queryId=ffffffff-c195-43d5-0000-000000000003]
org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 Target 
node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 
1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:9c391022
    at 
org.apache.ignite.internal.network.DefaultMessagingService.lambda$sendViaNetwork$3(DefaultMessagingService.java:360)
 
{noformat}
 

> Target node ID check is too strict
> ----------------------------------
>
>                 Key: IGNITE-25805
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25805
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When sending a message, it might be necessary to establish a connection 
> first. After establishing the connection, we check that the recipient 
> ClusterNode's ID matches the ID of the node to which we actually established 
> the connection. If they differ, we throw a RecipientLeftException.
> This makes sense when the MessagingService actually sends a message by a 
> ClusterNode to identify the recipient. This allows to make sure that the same 
> 'incarnation' of the node that is expected will get the message.
> But, if the user of MessagingService doesn't care about the specific 
> incarnation (they send by consistent ID or by network address), the check 
> should not be made at all.
> The effect can be pretty bad. For example, ScaleCube sends its internal 
> messages by network address. Here is a piece of log that demonstrates that 
> the physical topology on the node cannot be updated because getMetadata() 
> makes a request to the new (reappeared) node, but as its ID is different from 
> the one stored in the topology, the request fails, failing the topology 
> update (freezing the topology for possibly infinite time):
> {noformat}
> 2025-06-30 15:14:52:230 +0000 [WARNING][sc-cluster-3344-1][MetadataStore] 
> [default:RollingUpgradeTest_cluster_0:736eff04a39940e3@172.30.0.2:3344][25e0caca-ff93-499e-b8e5-93a6b0f9690c]
>  Timeout getting GetMetadataResp from 172.30.0.3:3345 within 3000 ms, cause: 
> org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 
> Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 
> 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa
> 2025-06-30 15:14:52:230 +0000 
> [WARNING][sc-cluster-3344-1][MembershipProtocol] 
> [default:Test_cluster_0:736eff04a39940e3@172.30.0.2:3344][updateMembership][SYNC]
>  Skipping to add/update member: {m: 
> default:Test_cluster_1:1ed5b18fbdce4afc@172.30.0.3:3345, s: ALIVE, inc: 0}, 
> due to failed fetchMetadata call (cause: 
> org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 
> Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 
> 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:844e01fa){noformat}
> Another issue is that, when SQL engine sends a request by consistent ID, but 
> the topology is not updated yet locally (as such updates happen 
> asynchronously), we might reject the message send, even though the recipient 
> node is ready to handle the message:
>  
> {noformat}
> 2025-06-30 15:14:53:588 +0000 
> [WARNING][RollingUpgradeTest_cluster_0-network-worker-1][ExecutionServiceImpl]
>  Fragment closing processed with errors: 
> [queryId=ffffffff-c195-43d5-0000-000000000003]
> org.apache.ignite.internal.network.RecipientLeftException: IGN-NETWORK-5 
> Target node ID is 446f2ec2-38f3-4636-9cd9-68626b6154cc, but 
> 1ed5b18f-bdce-4afc-8a00-35af4aaa8683 responded TraceId:9c391022
>     at 
> org.apache.ignite.internal.network.DefaultMessagingService.lambda$sendViaNetwork$3(DefaultMessagingService.java:360)
>  
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to