[ 
https://issues.apache.org/jira/browse/IGNITE-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367118#comment-16367118
 ] 

Pavel Kovalenko commented on IGNITE-7717:
-----------------------------------------

Cause of the problem:

Let's we have some existing topology and start several nodes in parallel.
1) We have several pending exchange futures (NODE_JOIN) in the queue. Each of 
the exchange futures has own discovery cache. Each discovery cache contains 
different state of the alive nodes and etc.
2) Pick and process first (earliest) future with the earliest state of 
discovery cache.
3) Update topology discovery cache (updateTopologies method) and do other 
exchange stuff.
4) After exchange is done we merge pending futures to the current and remove 
them from queue.
After that step we have valid node2part maps on all topologies, but have 
outdated discoveryCaches belong to the earliest (current) exchange future.

Possible fix:
After successful merge pending exchanges to the current, invoke 
'updateTopologies' method and update discovery caches with new version of 
topology.


> testAssignmentAfterRestarts is flaky on TC
> ------------------------------------------
>
>                 Key: IGNITE-7717
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7717
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.5
>            Reporter: Pavel Kovalenko
>            Assignee: Pavel Kovalenko
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>
> There is infinite awaiting of partitions map exchange:
> {noformat}
> [2018-02-15 13:41:46,180][WARN 
> ][test-runner-#1%persistence.IgnitePdsCacheAssignmentNodeRestartsTest%][root] 
> Waiting for topology map update 
> [igniteInstanceName=persistence.IgnitePdsCacheAssignmentNodeRestartsTest0, 
> cache=ignite-sys-cache, cacheId=-2100569601, topVer=AffinityTopologyVersion 
> [topVer=11, minorTopVer=0], p=0, affNodesCnt=5, ownersCnt=3, 
> affNodes=[126cbc54-1b9f-46b8-a978-b6c61ee00001, 
> 0971749e-e313-4c57-99ab-40400b100000, 84f71ca6-6213-43a0-91ea-42eca5100002, 
> 3d781b31-ed38-49c8-8875-bdfa2fa00003, 8f4bdf1c-a2c8-45e8-acd7-64bb45600004], 
> owners=[0971749e-e313-4c57-99ab-40400b100000, 
> 126cbc54-1b9f-46b8-a978-b6c61ee00001, 3d781b31-ed38-49c8-8875-bdfa2fa00003], 
> topFut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, 
> addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, 
> intOrder=6, lastExchangeTime=1518691298151, loc=false, 
> ver=2.5.0#19700101-sha1:00000000, isClient=false], topVer=9, 
> nodeId8=0971749e, msg=Node joined: TcpDiscoveryNode 
> [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, intOrder=6, 
> lastExchangeTime=1518691298151, loc=false, ver=2.5.0#19700101-sha1:00000000, 
> isClient=false], type=NODE_JOINED, tstamp=1518691298244], 
> crd=TcpDiscoveryNode [id=0971749e-e313-4c57-99ab-40400b100000, 
> addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, 
> intOrder=1, lastExchangeTime=1518691306165, loc=true, 
> ver=2.5.0#19700101-sha1:00000000, isClient=false], 
> exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=9, 
> minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode 
> [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, intOrder=6, 
> lastExchangeTime=1518691298151, loc=false, ver=2.5.0#19700101-sha1:00000000, 
> isClient=false], topVer=9, nodeId8=0971749e, msg=Node joined: 
> TcpDiscoveryNode [id=3d781b31-ed38-49c8-8875-bdfa2fa00003, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47502], discPort=47502, order=9, intOrder=6, 
> lastExchangeTime=1518691298151, loc=false, ver=2.5.0#19700101-sha1:00000000, 
> isClient=false], type=NODE_JOINED, tstamp=1518691298244], nodeId=3d781b31, 
> evt=NODE_JOINED], added=true, initFut=GridFutureAdapter 
> [ignoreInterrupts=false, state=DONE, res=true, hash=2121252210], init=true, 
> lastVer=GridCacheVersion [topVer=0, order=1518691297806, nodeOrder=0], 
> partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion 
> [topVer=9, minorTopVer=0], futures=[ExplicitLockReleaseFuture 
> [topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], futures=[]], 
> TxReleaseFuture [topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], 
> futures=[]], AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion 
> [topVer=9, minorTopVer=0], futures=[]], DataStreamerReleaseFuture 
> [topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], futures=[]]]], 
> exchActions=null, affChangeMsg=null, initTs=1518691298244, 
> centralizedAff=false, forceAffReassignment=false, changeGlobalStateE=null, 
> done=true, state=DONE, evtLatch=0, remaining=[], super=GridFutureAdapter 
> [ignoreInterrupts=false, state=DONE, res=AffinityTopologyVersion [topVer=11, 
> minorTopVer=0], hash=1135515588]], locNode=TcpDiscoveryNode 
> [id=0971749e-e313-4c57-99ab-40400b100000, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, 
> lastExchangeTime=1518691306165, loc=true, ver=2.5.0#19700101-sha1:00000000, 
> isClient=false]]
> {noformat}
> This happens because of inconsistency of discoCache (cacheGrpAffNodes map) on 
> different nodes after restart.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to