[ https://issues.apache.org/jira/browse/IGNITE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579022#comment-14579022 ]
Semen Boikov commented on IGNITE-1003: -------------------------------------- Did some testing with one server/one client, found one suspicous place in server dump at the moment when client compains about exchange timeout: {noformat} "grid-nio-worker-0-#67%null%" prio=10 tid=0x00007ff3888ce800 nid=0x1824 runnable [0x00007ff30dfbd000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x00000000ed988a28> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.openSocket(TcpDiscoverySpi.java:1097) at org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:541) at org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:470) at org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:433) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.pingNode(TcpDiscoverySpi.java:346) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.tryFailNode(GridDiscoveryManager.java:1459) at org.apache.ignite.internal.managers.GridManagerAdapter$1.tryFailNode(GridManagerAdapter.java:484) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2.onDisconnected(TcpCommunicationSpi.java:256) at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onExceptionCaught(GridNioFilterChain.java:253) at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100) at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onExceptionCaught(GridNioCodecFilter.java:74) at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100) at org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onExceptionCaught(GridConnectionBytesVerifyFilter.java:65) at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100) at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onExceptionCaught(GridNioServer.java:1985) at org.apache.ignite.internal.util.nio.GridNioFilterChain.onExceptionCaught(GridNioFilterChain.java:157) at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:1521) at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1346) at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1275) at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1159) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:108) at java.lang.Thread.run(Thread.java:722) {noformat} Here nio worker hangs in tryFailNode() so communication IO is blocked, need to move tryFailNode from nio worker. > Communication issues when running client node in separate subnetwork > -------------------------------------------------------------------- > > Key: IGNITE-1003 > URL: https://issues.apache.org/jira/browse/IGNITE-1003 > Project: Ignite > Issue Type: Bug > Components: general > Affects Versions: sprint-4 > Reporter: Valentin Kulichenko > Priority: Blocker > Fix For: sprint-5 > > Attachments: client.zip, server.zip, test.xml > > > Test is the following: > * Run 8 server nodes on one box. > * Start and stop client node in a loop on a different box in different > subnetwork (e.g., over VPN). > On one if iterations node join process will hang for several minutes due to > timeouts in initial partition exchange. At some point communication between > some of the server nodes stops working - messages wait in queue until > connection is closed and these messages are recovered. > Attached are configuration file used to run the test and logs with > communication debug enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)