[
https://issues.apache.org/jira/browse/IGNITE-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612133#comment-16612133
]
Kevin Cowan commented on IGNITE-3616:
-------------------------------------
This is actually more of a bug, than an improvement, and is effectively a
blocker if someone actually has ignite in a mission critical layer in the
application. Should this event occur, the only known resolution is to restart.
If there is another resolution, please advise. We have numerous clients whose
systems become unusable and must restart their production systems because of
this issue.
> Drop failed nodes from topology after a configured timeout
> ----------------------------------------------------------
>
> Key: IGNITE-3616
> URL: https://issues.apache.org/jira/browse/IGNITE-3616
> Project: Ignite
> Issue Type: Improvement
> Components: cache
> Affects Versions: 1.5.0.final
> Reporter: Alexey Goncharuk
> Priority: Major
>
> If an OOME or assertion happens on a node, it is not uncommon that partition
> exchange is stuck blocking the whole cluster. We should provide a mechanism
> to drop non-responsive nodes automatically.
> When partition exchange is times out, a coordinator should:
> - print out IDs/IPs of non-responsive nodes at all times
> - introduce a certain kill timeout for non-responsive nodes (-1 means
> disabled)
> - the timeout should be at least a minute after the 1st non-responsive node
> message is printed
> - when the timeout expires, we should kill the nodes and automatically
> collect their thread dumps (do best effort for it)
> - we should print out a message asking users to provide these thread dumps to
> us via Jira or dev list
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)