Andrey Aleksandrov created IGNITE-8098:
------------------------------------------

             Summary: Getting affinity for topology version earlier than 
affinity is calculated because of data race
                 Key: IGNITE-8098
                 URL: https://issues.apache.org/jira/browse/IGNITE-8098
             Project: Ignite
          Issue Type: Bug
    Affects Versions: 2.3
            Reporter: Andrey Aleksandrov


>From time to time the Ignite cluster with services throws next exception 
>during restarting of  some nodes:

java.lang.IllegalStateException: Getting affinity for topology version earlier 
than affinity is calculated [locNode=TcpDiscoveryNode 
[id=c770dbcf-2908-442d-8aa0-bf26a2aecfef, addrs=[10.44.162.169, 127.0.0.1], 
sockAddrs=[clrv0000041279.ic.ing.net/10.44.162.169:56500, /127.0.0.1:56500], 
discPort=56500, order=11, intOrder=8, lastExchangeTime=1520931375337, loc=true, 
ver=2.3.3#20180213-sha1:f446df34, isClient=false], grp=ignite-sys-cache, 
topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], 
head=AffinityTopologyVersion [topVer=15, minorTopVer=0], 
history=[AffinityTopologyVersion [topVer=11, minorTopVer=0], 
AffinityTopologyVersion [topVer=11, minorTopVer=1], AffinityTopologyVersion 
[topVer=12, minorTopVer=0], AffinityTopologyVersion [topVer=15, minorTopVer=0]]]

Looks like the reason of this issue is the data race in GridServiceProcessor 
class.


How to reproduce:

1)To simulate data race you should update next place in source code:



Class: GridServiceProcessor
Method: @Override public void onEvent(final DiscoveryEvent evt, final 
DiscoCache discoCache) {
Place:

....

try {
 svcName.set(dep.configuration().getName());

 ctx.cache().internalCache(UTILITY_CACHE_NAME).context().affinity().
 affinityReadyFuture(topVer).get();

//HERE (between GET and REASSIGN) you should add Thread.sleep(100) for example.

//try {
//Thread.sleep(100);
//}
//catch (InterruptedException e1) {
//e1.printStackTrace();
//}
 
 reassign(dep, topVer);
}
catch (IgniteCheckedException ex) {
 if (!(e instanceof ClusterTopologyCheckedException))
 LT.error(log, ex, "Failed to do service reassignment (will retry): " +
 dep.configuration().getName());

 retries.add(dep);
}

...

2)After that you should imitate start/shutdown iterations. For reproducing I 
used GridServiceProcessorBatchDeploySelfTest (but timeout on future.get should 
be increased to avoid timeout error)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to