[jira] [Commented] (IGNITE-9562) Destroyed cache that resurrected on a old offline node breaks PME

Eduard Shangareev (JIRA) Tue, 06 Aug 2019 08:15:06 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901162#comment-16901162
 ]


Eduard Shangareev commented on IGNITE-9562:
-------------------------------------------

The proper solution is moving cache configuration to metastore. There was 
implemented a workaround to not get the cluster to hang.


We don’t allow nodes with caches which don’t exist in running cluster and the 
nodes have data on disk.

To allow the nodes join, you need to remove directories with absent caches or 
start the nodes first.

> Destroyed cache that resurrected on a old offline node breaks PME
> -----------------------------------------------------------------
>
>                 Key: IGNITE-9562
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9562
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.5
>            Reporter: Pavel Kovalenko
>            Assignee: Eduard Shangareev
>            Priority: Critical
>             Fix For: 2.8
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Given:
> 2 nodes, persistence enabled.
> 1) Stop 1 node
> 2) Destroy cache through client
> 3) Start stopped node
> When the stopped node joins to cluster it starts all caches that it has seen 
> before stopping.
> If that cache was cluster-widely destroyed it leads to breaking the crash 
> recovery process or PME.
> Root cause - we don't start/collect caches from the stopped node on another 
> part of a cluster.
> In case of PARTITIONED cache mode that scenario breaks crash recovery:
> {noformat}
> java.lang.AssertionError: AffinityTopologyVersion [topVer=-1, minorTopVer=0]
>       at 
> org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:696)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.updateLocal(GridDhtPartitionTopologyImpl.java:2449)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.afterStateRestored(GridDhtPartitionTopologyImpl.java:679)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2445)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2321)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1568)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1308)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1255)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:766)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2577)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2457)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> In case of REPLICATED cache mode that scenario breaks PME coordinator process:
> {noformat}
> [2018-09-12 
> 18:50:36,407][ERROR][sys-#148%distributed.CacheStopAndRessurectOnOldNodeTest0%][GridCacheIoManager]
>  Failed to process message [senderId=4b6fd0d4-b756-4a9f-90ca-f0ee25100001, 
> messageType=class 
> o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsSingleMessage]
> java.lang.AssertionError: 3080586
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.clientTopology(GridCachePartitionExchangeManager.java:815)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updatePartitionSingleMap(GridDhtPartitionsExchangeFuture.java:3621)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2439)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:137)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2261)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2249)
>       at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383)
>       at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2249)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1628)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1100(GridCachePartitionExchangeManager.java:141)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:368)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:332)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2999)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2978)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> As one of the solutions - we shouldn't start such caches on resurrected nodes.
> We should save caches changes history somewhere and cluster-widely spread it 
> to joining nodes. 
> In a case when cache was only stopped, we can do nothing and start it lately 
> when cache start request received.
> In a case when cache was stopped & destroyed, we should clean persistence 
> data for that cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (IGNITE-9562) Destroyed cache that resurrected on a old offline node breaks PME

Reply via email to