[
https://issues.apache.org/jira/browse/IGNITE-11253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vyacheslav Koptilin updated IGNITE-11253:
-----------------------------------------
Description:
* In case of eager TTL is configured, a starting node creates and starts
{{cleanupWorker}} (see {{GridCacheTtlManager.start0()}})
* {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to wait
for {{discovery().localJoin()}} future that is completed by discovery thread.
* On the other hand, the exchange thread stops cache contexts and, therefore,
it stops the {{cleanupWorker}} as well.
{code:java}
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109)
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82)
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110)
org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111)
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495)
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182)
org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637)
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910)
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792)
{code}
So, exchange thread may try to stop the {{cleanupWorker}} before the
{{localJoin}} future is completed by discovery thread. Unfortunately,
`cleanupWorker` incorrectly handles this situation, and this fact can lead to a
node failure:
{code:java}
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got
interrupted while waiting for future to complete.]]
class org.apache.ignite.IgniteException: Got interrupted while waiting for
future to complete.
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217)
at
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.IgniteInterruptedCheckedException:
Got interrupted while waiting for future to complete.
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214)
... 3 more
{code}
The obvious fix is changing the catch block
{code:java}
catch (Throwable t) {
if (!(t instanceof IgniteInterruptedCheckedException))
err = t;
throw t;
}
{code}
to the following:
{code:java}
catch (Throwable t) {
if (!(X.cause(t, IgniteInterruptedCheckedException.class)))
err = t;
throw t;
}
{code}
was:
* In case of eager TTL is configured, a starting node creates and starts
{{cleanupWorker}} (see {{GridCacheTtlManager.start0()}})
* {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to wait
for {{discovery().localJoin()}} future that is completed by discovery thread.
* On the other hand, the exchange thread stops cache contexts and, therefore,
it stops the \{{cleanupWorker}} as well.
{code:java}
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109)
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82)
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110)
org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111)
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495)
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182)
org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637)
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910)
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792)
{code}
So, exchange thread may try to stop the {{cleanupWorker}} before the
{{localJoin}} future is completed by discovery thread.
Unfortunately, `cleanupWorker` incorrectly handles this situation, and this
fact can lead to a node failure:
{code:java}
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got
interrupted while waiting for future to complete.]]
class org.apache.ignite.IgniteException: Got interrupted while waiting for
future to complete.
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217)
at
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.IgniteInterruptedCheckedException:
Got interrupted while waiting for future to complete.
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214)
... 3 more
{code}
> When a node that is not part of the base topology joins the cluster, it may
> lead to a node failure.
> ---------------------------------------------------------------------------------------------------
>
> Key: IGNITE-11253
> URL: https://issues.apache.org/jira/browse/IGNITE-11253
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.7
> Reporter: Vyacheslav Koptilin
> Assignee: Vyacheslav Koptilin
> Priority: Major
> Fix For: 2.8
>
>
> * In case of eager TTL is configured, a starting node creates and starts
> {{cleanupWorker}} (see {{GridCacheTtlManager.start0()}})
> * {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to
> wait for {{discovery().localJoin()}} future that is completed by discovery
> thread.
> * On the other hand, the exchange thread stops cache contexts and,
> therefore, it stops the {{cleanupWorker}} as well.
>
> {code:java}
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109)
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82)
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110)
> org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111)
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495)
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182)
> org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637)
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910)
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792)
> {code}
> So, exchange thread may try to stop the {{cleanupWorker}} before the
> {{localJoin}} future is completed by discovery thread. Unfortunately,
> `cleanupWorker` incorrectly handles this situation, and this fact can lead to
> a node failure:
> {code:java}
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got
> interrupted while waiting for future to complete.]]
> class org.apache.ignite.IgniteException: Got interrupted while waiting for
> future to complete.
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136)
> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: class
> org.apache.ignite.internal.IgniteInterruptedCheckedException: Got interrupted
> while waiting for future to complete.
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214)
> ... 3 more
> {code}
> The obvious fix is changing the catch block
> {code:java}
> catch (Throwable t) {
> if (!(t instanceof IgniteInterruptedCheckedException))
> err = t;
> throw t;
> }
> {code}
> to the following:
> {code:java}
> catch (Throwable t) {
> if (!(X.cause(t, IgniteInterruptedCheckedException.class)))
> err = t;
> throw t;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)