[
https://issues.apache.org/jira/browse/IGNITE-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612412#comment-16612412
]
Nicholas DiPiazza commented on IGNITE-3616:
-------------------------------------------
Yep this one's killing us in production. anyone working on this yet?
> Drop failed nodes from topology after a configured timeout
> ----------------------------------------------------------
>
> Key: IGNITE-3616
> URL: https://issues.apache.org/jira/browse/IGNITE-3616
> Project: Ignite
> Issue Type: Improvement
> Components: cache
> Affects Versions: 1.5.0.final
> Reporter: Alexey Goncharuk
> Priority: Major
>
> If an OOME or assertion happens on a node, it is not uncommon that partition
> exchange is stuck blocking the whole cluster. We should provide a mechanism
> to drop non-responsive nodes automatically.
> When partition exchange is times out, a coordinator should:
> - print out IDs/IPs of non-responsive nodes at all times
> - introduce a certain kill timeout for non-responsive nodes (-1 means
> disabled)
> - the timeout should be at least a minute after the 1st non-responsive node
> message is printed
> - when the timeout expires, we should kill the nodes and automatically
> collect their thread dumps (do best effort for it)
> - we should print out a message asking users to provide these thread dumps to
> us via Jira or dev list
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)