[ 
https://issues.apache.org/jira/browse/IGNITE-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amelchev Nikita updated IGNITE-16589:
-------------------------------------
    Release Note: Fixed an issue that led to failures of server nodes due to 
short history of affinity assignments. 

> Failure handler kills server node on getting affinity from old topology
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-16589
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16589
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vyacheslav Koptilin
>            Assignee: Vyacheslav Koptilin
>            Priority: Major
>             Fix For: 2.13
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In general, the following exception seems to be a bit overkill
> {code:java}
> [2022-02-21 
> 10:34:53,347][ERROR][aff-#300%cache.CacheNoAffinityExchangeTest0%][IgniteTestResources]
>  Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]], 
> failureCtx=FailureContext [type=CRITICAL_ERROR, 
> err=java.lang.IllegalStateException: Getting affinity for too old topology 
> version that is already out of history [locNode=TcpDiscoveryNode 
> [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, 
> addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], 
> discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, 
> loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], 
> grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], 
> lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], 
> head=AffinityTopologyVersion [topVer=8, minorTopVer=0], 
> history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], 
> AffinityTopologyVersion [topVer=8, minorTopVer=0]]]]]
> java.lang.IllegalStateException: Getting affinity for too old topology 
> version that is already out of history [locNode=TcpDiscoveryNode 
> [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, 
> addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], 
> discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, 
> loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], 
> grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], 
> lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], 
> head=AffinityTopologyVersion [topVer=8, minorTopVer=0], 
> history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], 
> AffinityTopologyVersion [topVer=8, minorTopVer=0]]]
>       at 
> org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:849)
>       at 
> org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:796)
>       at 
> org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest0(CacheGroupContext.java:1130)
>       at 
> org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest(CacheGroupContext.java:1116)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421)
>       at 
> org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> it seems that much more suitable way to handle this should be to kill/restart 
> the client node or return an error for the client to handle or to attempt 
> remapping the operation on the new topology. For now, it is not clear why 
> exactly the node must be shut down after this exception and documentation 
> doesn't explain how to avoid this.
> The root cause of the issue is that AffinityRequest from the client node 
> refers to the "old" topology version, which has been already wiped on the 
> server-side.
> The possible scenario is the following:
>  - the client wants to get a proxy for a cache that is already started on 
> server nodes
>  - client started custom exchange task (see 
> GridCacheProcessor.processCustomExchangeTask)
>  - before sending AffinityRequest the client node hanged due to a long GC 
> pause, for example
>  - cluster topology changed multiple times during that GC pause (the number 
> of changes is enough to clean up the old history of affinity assignments)
>  - server node received AffinityRequest from the client and, unfortunately, 
> cannot process it in the right way because of a lack of history.
> IMHO we can respond to the client with an "empty" AffinityResponse which 
> should provide a cause of the problem.
> In that case, the client node may try to reconnect to the cluster (if PME is 
> in progress), or retry an operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to