[ 
https://issues.apache.org/jira/browse/IGNITE-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612412#comment-16612412
 ] 

Nicholas DiPiazza commented on IGNITE-3616:
-------------------------------------------

Yep this one's killing us in production. anyone working on this yet? 

> Drop failed nodes from topology after a configured timeout
> ----------------------------------------------------------
>
>                 Key: IGNITE-3616
>                 URL: https://issues.apache.org/jira/browse/IGNITE-3616
>             Project: Ignite
>          Issue Type: Improvement
>          Components: cache
>    Affects Versions: 1.5.0.final
>            Reporter: Alexey Goncharuk
>            Priority: Major
>
> If an OOME or assertion happens on a node, it is not uncommon that partition 
> exchange is stuck blocking the whole cluster. We should provide a mechanism 
> to drop non-responsive nodes automatically.
> When partition exchange is times out, a coordinator should:
> - print out IDs/IPs of non-responsive nodes at all times
> - introduce a certain kill timeout for non-responsive nodes (-1 means
> disabled)
> - the timeout should be at least a minute after the 1st non-responsive node
> message is printed
> - when the timeout expires, we should kill the nodes and automatically
> collect their thread dumps (do best effort for it)
> - we should print out a message asking users to provide these thread dumps to 
> us via Jira or dev list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to