[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611324#comment-16611324
 ] 

David Harvey commented on IGNITE-9026:
--------------------------------------

We had a related issue that currently appears like separate bugs:   when the 
server went back to the client, it got a network timeout on the p2p message, 
but the client stayed in the grid.   It would seem like the p2p message should 
not fail for a network error which was not considered fatal to the node.  Then, 
every subsequent attempt to load the class failed.

> Two levels of Peer class loading fails in CONTINUOUS mode
> ---------------------------------------------------------
>
>                 Key: IGNITE-9026
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9026
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: David Harvey
>            Assignee: David Harvey
>            Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to