[ 
https://issues.apache.org/jira/browse/IGNITE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579022#comment-14579022
 ] 

Semen Boikov commented on IGNITE-1003:
--------------------------------------

Did some testing with one server/one client, found one suspicous place in 
server dump at the moment when client compains about exchange timeout:
{noformat}
"grid-nio-worker-0-#67%null%" prio=10 tid=0x00007ff3888ce800 nid=0x1824 
runnable [0x00007ff30dfbd000]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        - locked <0x00000000ed988a28> (a java.net.SocksSocketImpl)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
        at java.net.Socket.connect(Socket.java:579)
        at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.openSocket(TcpDiscoverySpi.java:1097)
        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:541)
        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:470)
        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:433)
        at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.pingNode(TcpDiscoverySpi.java:346)
        at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.tryFailNode(GridDiscoveryManager.java:1459)
        at 
org.apache.ignite.internal.managers.GridManagerAdapter$1.tryFailNode(GridManagerAdapter.java:484)
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2.onDisconnected(TcpCommunicationSpi.java:256)
        at 
org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onExceptionCaught(GridNioFilterChain.java:253)
        at 
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
        at 
org.apache.ignite.internal.util.nio.GridNioCodecFilter.onExceptionCaught(GridNioCodecFilter.java:74)
        at 
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
        at 
org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onExceptionCaught(GridConnectionBytesVerifyFilter.java:65)
        at 
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
        at 
org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onExceptionCaught(GridNioServer.java:1985)
        at 
org.apache.ignite.internal.util.nio.GridNioFilterChain.onExceptionCaught(GridNioFilterChain.java:157)
        at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:1521)
        at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1346)
        at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1275)
        at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1159)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:108)
        at java.lang.Thread.run(Thread.java:722)
{noformat}

Here nio worker hangs in tryFailNode() so communication IO is blocked, need to 
move tryFailNode from nio worker.

> Communication issues when running client node in separate subnetwork
> --------------------------------------------------------------------
>
>                 Key: IGNITE-1003
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1003
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: sprint-4
>            Reporter: Valentin Kulichenko
>            Priority: Blocker
>             Fix For: sprint-5
>
>         Attachments: client.zip, server.zip, test.xml
>
>
> Test is the following:
> * Run 8 server nodes on one box.
> * Start and stop client node in a loop on a different box in different 
> subnetwork (e.g., over VPN).
> On one if iterations node join process will hang for several minutes due to 
> timeouts in initial partition exchange. At some point communication between 
> some of the server nodes stops working - messages wait in queue until 
> connection is closed and these messages are recovered.
> Attached are configuration file used to run the test and logs with 
> communication debug enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to