[ https://issues.apache.org/jira/browse/IGNITE-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149185#comment-16149185 ]
Ilya Kasnacheev commented on IGNITE-6071: ----------------------------------------- I don't understand what were the circumstances that led to the problem, however I have reproduced it as a wider issue. I've wrote a letter to dev@ titled "Hopeless looping in TcpCommunicationSpi" > Client may detect necessity for reconnect for too long > ------------------------------------------------------ > > Key: IGNITE-6071 > URL: https://issues.apache.org/jira/browse/IGNITE-6071 > Project: Ignite > Issue Type: Bug > Affects Versions: 2.1 > Reporter: Yakov Zhdanov > Assignee: Ilya Kasnacheev > > There was a GC pause on client that caused servers to drop client due to > inability to establish TCP communication connection. Then it took some time > for client to detect that it has been dropped. During that time client many > times attempted to connect to server which can be seen in the logs. After > client detected its drop and reconnected servers fired node added event and > no log flood can be found any more. > We need to find out why client was reconnecting via communication and did not > detect the drop for such a long time. > I hope this can be reproduced in test: > * start 2 servers > * start client > * suspend all client threads with Thread.suspend() - just filter threads of > current JVM by name and suspend ones belonging to the client. > {noformat} > [10:12:24,785][WARNING][disco-event-worker-#71%null%][GridDiscoveryManager] > Node FAILED: TcpDiscoveryNode [id=dd71479c-41ba-443e-b25c-3803a2a94f4f, > addrs=[10.44.3.14, 127.0.0.1], sockAddrs=[/127.0.0.1:0, > XXX.com/10.44.3.14:0], discPort=0, order=2, intOrder=2, > lastExchangeTime=1502269008673, loc=false, ver=2.1.1#20170618-sha1:09ce29e0, > isClient=true] > [10:12:24,785][INFO][disco-event-worker-#71%null%][GridDiscoveryManager] > Topology snapshot [ver=5, servers=2, clients=1, CPUs=144, heap=76.0GB] > [10:12:24,794][INFO][exchange-worker-#72%null%][time] Started exchange init > [topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], crd=false, evt=12, > node=TcpDiscoveryNode [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, > addrs=[10.44.3.11, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, > XXX.com/10.44.3.11:47500], discPort=47500, order=3, intOrder=3, > lastExchangeTime=1502269944782, loc=true, ver=2.1.1#20170618-sha1:09ce29e0, > isClient=false], evtNode=TcpDiscoveryNode > [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, addrs=[10.44.3.11, 127.0.0.1], > sockAddrs=[/127.0.0.1:47500, XXX.com/10.44.3.11:47500], discPort=47500, > order=3, intOrder=3, lastExchangeTime=1502269944782, loc=true, > ver=2.1.1#20170618-sha1:09ce29e0, isClient=false], customEvt=null] > [10:12:24,813][INFO][exchange-worker-#72%null%][time] Finished exchange init > [topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], crd=false] > [10:12:24,819][INFO][exchange-worker-#72%null%][GridCachePartitionExchangeManager] > Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion > [topVer=5, minorTopVer=0], evt=NODE_FAILED, > node=dd71479c-41ba-443e-b25c-3803a2a94f4f] > [10:12:28,344][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52474] > [10:12:28,348][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52482] > [10:12:28,356][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52506] > [10:12:28,362][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52522] > [10:12:28,368][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52538] > [10:12:28,374][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52554] > [10:12:28,380][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52570] > [10:12:28,386][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52586] > [10:12:28,392][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52602] > [10:12:28,397][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52618] > [10:12:28,402][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52634] > [10:12:28,407][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52650] > [10:12:28,412][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:52666] > ... > [10:18:32,684][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:43604] > [10:18:32,690][INFO][grid-nio-worker-tcp-comm-1-#58%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:43620] > [10:18:32,695][INFO][grid-nio-worker-tcp-comm-0-#57%null%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/10.44.3.11:47100, > rmtAddr=/10.44.3.14:43636] > [10:18:42,831][INFO][disco-event-worker-#71%null%][GridDiscoveryManager] > Added new node to topology: TcpDiscoveryNode > [id=2e80b0f0-21db-451d-a264-34ba16e00ffa, addrs=[10.44.3.14, 127.0.0.1], > sockAddrs=[/127.0.0.1:0, > gbrdsr000002837.intranet.barcapint.com/10.44.3.14:0], discPort=0, order=6, > intOrder=5, lastExchangeTime=1502270322805, loc=false, > ver=2.1.1#20170618-sha1:09ce29e0, isClient=true] > [10:18:42,832][INFO][disco-event-worker-#71%null%][GridDiscoveryManager] > Topology snapshot [ver=6, servers=2, clients=2, CPUs=144, heap=90.0GB] > [10:18:42,833][INFO][exchange-worker-#72%null%][time] Started exchange init > [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=false, evt=10, > node=TcpDiscoveryNode [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, > addrs=[10.44.3.11, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, > XXX.com/10.44.3.11:47500], discPort=47500, order=3, intOrder=3, > lastExchangeTime=1502270322815, loc=true, ver=2.1.1#20170618-sha1:09ce29e0, > isClient=false], evtNode=TcpDiscoveryNode > [id=98c1fdf7-09db-4fa0-bb01-8ca7f046643d, addrs=[10.44.3.11, 127.0.0.1], > sockAddrs=[/127.0.0.1:47500, XXX.com/10.44.3.11:47500], discPort=47500, > order=3, intOrder=3, lastExchangeTime=1502270322815, loc=true, > ver=2.1.1#20170618-sha1:09ce29e0, isClient=false], customEvt=null] > [10:18:42,851][INFO][exchange-worker-#72%null%][time] Finished exchange init > [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=false] > [10:18:42,855][INFO][exchange-worker-#72%null%][GridCachePartitionExchangeManager] > Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion > [topVer=6, minorTopVer=0], evt=NODE_JOINED, > node=2e80b0f0-21db-451d-a264-34ba16e00ffa] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)