[ https://issues.apache.org/jira/browse/IGNITE-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367118#comment-16367118 ]
Pavel Kovalenko commented on IGNITE-7717: ----------------------------------------- Cause of the problem: Let's we have some existing topology and start several nodes in parallel. 1) We have several pending exchange futures (NODE_JOIN) in the queue. Each of the exchange futures has own discovery cache. Each discovery cache contains different state of the alive nodes and etc. 2) Pick and process first (earliest) future with the earliest state of discovery cache. 3) Update topology discovery cache (updateTopologies method) and do other exchange stuff. 4) After exchange is done we merge pending futures to the current and remove them from queue. After that step we have valid node2part maps on all topologies, but have outdated discoveryCaches belong to the earliest (current) exchange future. Possible fix: After successful merge pending exchanges to the current, invoke 'updateTopologies' method and update discovery caches with new version of topology. > testAssignmentAfterRestarts is flaky on TC > ------------------------------------------ > > Key: IGNITE-7717 > URL: https://issues.apache.org/jira/browse/IGNITE-7717 > Project: Ignite > Issue Type: Bug > Components: cache > Affects Versions: 2.5 > Reporter: Pavel Kovalenko > Assignee: Pavel Kovalenko > Priority: Major > Labels: MakeTeamcityGreenAgain > > There is infinite awaiting of partitions map exchange: > {noformat} > [2018-02-15 13:41:46,180][WARN > ][test-runner-#1%persistence.IgnitePdsCacheAssignmentNodeRestartsTest%][root] > Waiting for topology map update > [igniteInstanceName=persistence.IgnitePdsCacheAssignmentNodeRestartsTest0, > cache=ignite-sys-cache, cacheId=-2100569601, topVer=AffinityTopologyVersion > [topVer=11, minorTopVer=0], p=0, affNodesCnt=5, ownersCnt=3, > affNodes=[126cbc54-1b9f-46b8-a978-b6c61ee00001, > 0971749e-e313-4c57-99ab-40400b100000, 84f71ca6-6213-43a0-91ea-42eca5100002, > 3d781b31-ed38-49c8-8875-bdfa2fa00003, 8f4bdf1c-a2c8-45e8-acd7-64bb45600004], > owners=[0971749e-e313-4c57-99ab-40400b100000, > 126cbc54-1b9f-46b8-a978-b6c61ee00001, 3d781b31-ed38-49c8-8875-bdfa2fa00003], > topFut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent > [evtNode=TcpDiscoveryNode [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, > addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, > intOrder=6, lastExchangeTime=1518691298151, loc=false, > ver=2.5.0#19700101-sha1:00000000, isClient=false], topVer=9, > nodeId8=0971749e, msg=Node joined: TcpDiscoveryNode > [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, intOrder=6, > lastExchangeTime=1518691298151, loc=false, ver=2.5.0#19700101-sha1:00000000, > isClient=false], type=NODE_JOINED, tstamp=1518691298244], > crd=TcpDiscoveryNode [id=0971749e-e313-4c57-99ab-40400b100000, > addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, > intOrder=1, lastExchangeTime=1518691306165, loc=true, > ver=2.5.0#19700101-sha1:00000000, isClient=false], > exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=9, > minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode > [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, intOrder=6, > lastExchangeTime=1518691298151, loc=false, ver=2.5.0#19700101-sha1:00000000, > isClient=false], topVer=9, nodeId8=0971749e, msg=Node joined: > TcpDiscoveryNode [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, intOrder=6, > lastExchangeTime=1518691298151, loc=false, ver=2.5.0#19700101-sha1:00000000, > isClient=false], type=NODE_JOINED, tstamp=1518691298244], nodeId=3d781b31, > evt=NODE_JOINED], added=true, initFut=GridFutureAdapter > [ignoreInterrupts=false, state=DONE, res=true, hash=2121252210], init=true, > lastVer=GridCacheVersion [topVer=0, order=1518691297806, nodeOrder=0], > partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion > [topVer=9, minorTopVer=0], futures=[ExplicitLockReleaseFuture > [topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], futures=[]], > TxReleaseFuture [topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], > futures=[]], AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion > [topVer=9, minorTopVer=0], futures=[]], DataStreamerReleaseFuture > [topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], futures=[]]]], > exchActions=null, affChangeMsg=null, initTs=1518691298244, > centralizedAff=false, forceAffReassignment=false, changeGlobalStateE=null, > done=true, state=DONE, evtLatch=0, remaining=[], super=GridFutureAdapter > [ignoreInterrupts=false, state=DONE, res=AffinityTopologyVersion [topVer=11, > minorTopVer=0], hash=1135515588]], locNode=TcpDiscoveryNode > [id=0971749e-e313-4c57-99ab-40400b100000, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, > lastExchangeTime=1518691306165, loc=true, ver=2.5.0#19700101-sha1:00000000, > isClient=false]] > {noformat} > This happens because of inconsistency of discoCache (cacheGrpAffNodes map) on > different nodes after restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005)