[ https://issues.apache.org/jira/browse/IGNITE-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amelchev Nikita updated IGNITE-16589: ------------------------------------- Release Note: Fixed an issue that led to failures of server nodes due to short history of affinity assignments. > Failure handler kills server node on getting affinity from old topology > ----------------------------------------------------------------------- > > Key: IGNITE-16589 > URL: https://issues.apache.org/jira/browse/IGNITE-16589 > Project: Ignite > Issue Type: Bug > Reporter: Vyacheslav Koptilin > Assignee: Vyacheslav Koptilin > Priority: Major > Fix For: 2.13 > > Time Spent: 20m > Remaining Estimate: 0h > > In general, the following exception seems to be a bit overkill > {code:java} > [2022-02-21 > 10:34:53,347][ERROR][aff-#300%cache.CacheNoAffinityExchangeTest0%][IgniteTestResources] > Critical system error detected. Will be handled accordingly to configured > handler [hnd=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]], > failureCtx=FailureContext [type=CRITICAL_ERROR, > err=java.lang.IllegalStateException: Getting affinity for too old topology > version that is already out of history [locNode=TcpDiscoveryNode > [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, > addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], > discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, > loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], > grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], > lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], > head=AffinityTopologyVersion [topVer=8, minorTopVer=0], > history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], > AffinityTopologyVersion [topVer=8, minorTopVer=0]]]]] > java.lang.IllegalStateException: Getting affinity for too old topology > version that is already out of history [locNode=TcpDiscoveryNode > [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, > addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], > discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, > loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], > grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], > lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], > head=AffinityTopologyVersion [topVer=8, minorTopVer=0], > history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], > AffinityTopologyVersion [topVer=8, minorTopVer=0]]] > at > org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:849) > at > org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:796) > at > org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest0(CacheGroupContext.java:1130) > at > org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest(CacheGroupContext.java:1116) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907) > at > org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242) > at > org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421) > at > org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > it seems that much more suitable way to handle this should be to kill/restart > the client node or return an error for the client to handle or to attempt > remapping the operation on the new topology. For now, it is not clear why > exactly the node must be shut down after this exception and documentation > doesn't explain how to avoid this. > The root cause of the issue is that AffinityRequest from the client node > refers to the "old" topology version, which has been already wiped on the > server-side. > The possible scenario is the following: > - the client wants to get a proxy for a cache that is already started on > server nodes > - client started custom exchange task (see > GridCacheProcessor.processCustomExchangeTask) > - before sending AffinityRequest the client node hanged due to a long GC > pause, for example > - cluster topology changed multiple times during that GC pause (the number > of changes is enough to clean up the old history of affinity assignments) > - server node received AffinityRequest from the client and, unfortunately, > cannot process it in the right way because of a lack of history. > IMHO we can respond to the client with an "empty" AffinityResponse which > should provide a cause of the problem. > In that case, the client node may try to reconnect to the cluster (if PME is > in progress), or retry an operation. -- This message was sent by Atlassian Jira (v8.20.1#820001)