[ 
https://issues.apache.org/jira/browse/IGNITE-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326351#comment-16326351
 ] 

Vitaliy Biryukov commented on IGNITE-6953:
------------------------------------------

Hi, [~dmagda].
I've investigated this issue and figured out scenario, which leads to a hangup: 
# Client node performs some transactions.
# OOME occurs in some system thread, required to perform transactions.
# The thread dies and some memory is released.
# Node continues to perform some operations, including pings.
# Partition map exchange occurs and waits for all transactions completion.
# All transactions on the entire grid are waiting for partition map exchange 
completion.
 
I've attached logs of some scenarios with real OOMEs and simple reproducer in 
the test (just throw OOME in the right place).

I think it's better to avoid such situations instead of throwing errors (as in 
the code below), change some state and try to stop the node. And send node 
state in ping. If ping received has a bad state, then throw the node out of the 
topology.
{code:java}
catch (Throwable e) {
    .
    .
    .
    if (e instanceof Error)
        throw e;
}
{code}
Any thoughts?

> OOM on the client node freezes the whole cluster
> ------------------------------------------------
>
>                 Key: IGNITE-6953
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6953
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Magda
>            Assignee: Vitaliy Biryukov
>            Priority: Major
>              Labels: iep-5
>             Fix For: 2.4
>
>         Attachments: OOMETest.java, ignite-0590a557.log, ignite-5df99d7b.log, 
> ignite-9b5b6e6e.log
>
>
> It's reported that if an OOM happens on the client side the whole cluster 
> becomes unresponsive:
> http://apache-ignite-users.70518.x6.nabble.com/Out-of-memory-in-client-node-freezes-complete-cluster-td18044.html
> Let's reproduce and prevent this by bringing the client node down 
> automatically.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to