[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620305#comment-16620305
 ] 

Ryabov Dmitrii commented on IGNITE-9026:
----------------------------------------

PDS tests were unmuted 2 days ago (branch have master 7 days old), so they 
should be ok. Test {{testGetReadThrough}} had 2 fails 2 weeks ago, 
[~syssoftsol], please, rebase your branch on current master and rerun "Run All" 
on [TeamCity|https://ci.ignite.apache.org].

> Two levels of Peer class loading fails in CONTINUOUS mode
> ---------------------------------------------------------
>
>                 Key: IGNITE-9026
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9026
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: David Harvey
>            Assignee: David Harvey
>            Priority: Major
>         Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to