[
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721485#comment-16721485
]
Dmitriy Pavlov commented on IGNITE-9026:
----------------------------------------
[~akalashnikov], what can be our next steps? should we move the issue to Open?
> Two levels of Peer class loading fails in CONTINUOUS mode
> ---------------------------------------------------------
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.5
> Reporter: David Harvey
> Assignee: David Harvey
> Priority: Major
> Fix For: 2.8
>
> Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom
> StreamReceiver that sometimes sends closures on the peer class loaded code to
> other servers. However, we ended up running out of Metaspace, because we had
> > 6000 class loaders! We suspected a regression in this change
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
> so we switched to CONTINUOUS mode. We then started getting failures to
> load some of the classes for the closures on the second server. Through
> some testing and code inspection, there seems to be the following flaws
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an
> online node that responds to the request (timeout is treated as offline
> node), with either success or failure, and then the loop terminates. The
> assumption is that all nodes are equally capable of providing the resource,
> so if one fails, then the others would also fail.
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has
> a check for a cycle, i.e., whether the destination node is one of the nodes
> that originated or forwarded this request, and in that case, a failure
> response is faked. However, that causes the caller's loop to terminate. So
> depending on the order of the nodes in the participant list,
> sendResourceRequest() may fail before trying any nodes because it has one of
> the calling nodes on this list. It should instead be skipping any of the
> calling nodes.
> Example with 1 client node a 2 server nodes: C1 sends data to S1, which
> forwards closure to S2. C1 also sends to S2 which forwards to S1. So now
> the node lists on S1 and S2 contain C1 and the other S node. If the order
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to
> load a class, it will try S2, then S2 will try S1, but will get a fake
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to
> the resource. Assume S1 knows about userVersion1 via S3 and S4, with S3
> though C1 and S4 through C2. If C2 fails, then S4 is not capable of getting
> back to a master, but S1 has no way of knowing that.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)