[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545297#comment-16545297 ]
Anton Vinogradov commented on IGNITE-8783: ------------------------------------------ [~ilantukh], 4 problems related to ExchangeLatch hang found: 1) pendingAcks was ignored at client latch recreation on coordinator change. +Fixed+. {noformat} // There is final ack for created latch. if (pendingAcks.containsKey(latchId)) { {noformat} was replaced with {noformat} Set<UUID> nodeIds = pendingAcks.get(latchId); // There is final ack for created latch. if (nodeIds != null && nodeIds.contains(coordinator)) { {noformat} 2) Topology change could cause coordinator change even in case coordinator node not failed. +Fixed+. added sorting by order to {{getLatchCoordinator}} {noformat} .sorted(Comparator.comparing(ClusterNode::order)) {noformat} Now coordinator is alwais oldest node. 3) Sometimes Latch fails on message send in case connection was not established yet. {noformat} [2018-07-13 19:06:43,910][ERROR][exchange-worker-#233015%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest4%][TcpCommunicationSpi] Failed to send message to remote node [node=TcpDiscoveryNode [id=3838f6ed-1b4d-484d-9773-df4493700000, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1531498003891, loc=false, ver=2.6.0#19700101-sha1:00000000, isClient=false], msg=GridIoMessage [plc=2, topic=TOPIC_EXCHANGE, topicOrd=31, ordered=false, timeout=0, skipOnTimeout=false, msg=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.LatchAckMessage@77e7c81f]] class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=3838f6ed-1b4d-484d-9773-df4493700000, addrs=[/127.0.0.1:45010]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3449) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2977) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2860) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2703) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2662) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1643) at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1715) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.sendAck(ExchangeLatchManager.java:624) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.countDown(ExchangeLatchManager.java:642) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.waitPartitionRelease(GridDhtPartitionsExchangeFuture.java:1406) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1177) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:732) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2477) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2357) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) at java.lang.Thread.run(Thread.java:748) Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/127.0.0.1:45010, err=Address already in use: no further information] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3452) ... 15 more Caused by: java.net.BindException: Address already in use: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:111) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3289) ... 15 more [2018-07-13 19:06:43,911][INFO ][exchange-worker-#233015%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest4%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], waitTime=0ms, futInfo=NA] [2018-07-13 19:06:43,911][ERROR][exchange-worker-#233015%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest4%][ExchangeLatchManager] Failed to send ack [latch=exchange-AffinityTopologyVersion [topVer=4, minorTopVer=0], to=3838f6ed-1b4d-484d-9773-df4493700000]: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=3838f6ed-1b4d-484d-9773-df4493700000, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1531498003891, loc=false, ver=2.6.0#19700101-sha1:00000000, isClient=false], topic=TOPIC_EXCHANGE, msg=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.LatchAckMessage@77e7c81f, policy=2] [2018-07-13 19:06:43,913][INFO ][grid-nio-worker-tcp-comm-2-#232820%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest0%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:45010, rmtAddr=/127.0.0.1:62577] [2018-07-13 19:06:43,913][INFO ][grid-nio-worker-tcp-comm-0-#232990%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest4%][TcpCommunicationSpi] Established outgoing communication connection [locAddr=/127.0.0.1:62577, rmtAddr=/127.0.0.1:45010] {noformat} Seems, we have some race/bug at TcpCommunicationSpi? 4) {{GridCachePartitionedDataStructuresFailoverSelfTest#testReentrantLockConstantTopologyChangeNonFailoverSafe}} can hang in case of broken tx {noformat} Pending transactions: [2018-07-15 14:13:41,210][WARN ][exchange-worker-#1596354%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest1%][diagnostic] >>> [txVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], exchWait=true, tx=GridDhtTxLocal [nearNodeId=1392b1bd-c807-4479-9bfe-fc9f70500000, nearFutId=14ffca0a461-999e75d0-a333-4bd6-a2a2-7f143d0af773, nearMiniId=1, nearFinFutId=null, nearFinMiniId=0, nearXidVer=GridCacheVersion [topVer=143133203, order=1531653200153, nodeOrder=1], super=GridDhtTxLocalAdapter [nearOnOriginatingNode=false, nearNodes=[], dhtNodes=[], explicitLock=false, super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false, depEnabled=false, txState=IgniteTxStateImpl [activeCacheIds=[1968300681], recovery=false, txMap=[IgniteTxEntry [key=KeyCacheObjectImpl [part=494, val=GridCacheInternalKeyImpl [name=structure, grpName=default-volatile-ds-group], hasValBytes=true], cacheId=1968300681, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=494, val=GridCacheInternalKeyImpl [name=structure, grpName=default-volatile-ds-group], hasValBytes=true], cacheId=1968300681], val=[op=NOOP, val=null], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=[], filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=494, super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=494, val=GridCacheInternalKeyImpl [name=structure, grpName=default-volatile-ds-group], hasValBytes=true], val=CacheObjectImpl [val=null, hasValBytes=true], ver=GridCacheVersion [topVer=143133201, order=1531653200154, nodeOrder=2], hash=2095426867, extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=[GridCacheMvccCandidate [nodeId=1bf28b00-feed-412b-a20b-ca9fc1100001, ver=GridCacheVersion [topVer=143133203, order=1531653200157, nodeOrder=2], threadId=1947290, id=31143709, topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], reentry=null, otherNodeId=1392b1bd-c807-4479-9bfe-fc9f70500000, otherVer=GridCacheVersion [topVer=143133203, order=1531653200153, nodeOrder=1], mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=494, val=GridCacheInternalKeyImpl [name=structure, grpName=default-volatile-ds-group], hasValBytes=true], masks=local=1|owner=1|ready=1|reentry=0|used=0|tx=1|single_implicit=0|dht_local=1|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]], rmts=null]], flags=2]]], prepared=0, locked=false, nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [topVer=143133203, order=1531653200157, nodeOrder=2]]]], super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=143133203, order=1531653200157, nodeOrder=2], writeVer=null, implicit=false, loc=true, threadId=1947290, startTime=1531653200578, nodeId=1bf28b00-feed-412b-a20b-ca9fc1100001, startVer=GridCacheVersion [topVer=143133203, order=1531653200157, nodeOrder=2], endVer=null, isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=true, plc=2, commitVer=null, finalizing=NONE, invalidParts=null, state=ACTIVE, timedOut=false, topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], duration=20632ms, onePhaseCommit=false], size=1]]]] {noformat} __ So, I propose to merge #1 & #2 (this solves 95% hangs), and create issues for #3 & #4. TC checked: - [https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures&tab=buildTypeStatusDiv&branch_IgniteTests24Java8=ignite-8783] - [https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&branch_IgniteTests24Java8=ignite-8783] > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > ---------------------------------------------------------------------------------- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures > Reporter: Ivan Rakov > Assignee: Anton Vinogradov > Priority: Major > Labels: MakeTeamcityGreenAgain > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures&tab=buildTypeHistoryList&branch_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)