[jira] [Commented] (IGNITE-10226) Partition may restore wrong MOVING state during crash recovery

Matija Polajnar (Jira) Wed, 16 Oct 2019 04:26:20 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952729#comment-16952729
 ]


Matija Polajnar commented on IGNITE-10226:
------------------------------------------

On development environments (for now, luckily) we sometimes get errors like 
this one:
{code:java}
    ...
Caused by: javax.cache.CacheException: class 
org.apache.ignite.cluster.ClusterTopologyException: Cannot run update query. 
Node must own all the necessary partitions.
    at 
org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.IgniteCacheFutureImpl.convertException(IgniteCacheFutureImpl.java:62)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:137)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
com.marand.thinkehr.tasks.common.ignite.IgniteCompletableFuture.lambda$new$2ae3f52e$1(IgniteCompletableFuture.java:25)
 ~[classes/:?]
    at 
org.apache.ignite.internal.util.future.IgniteFutureImpl$InternalFutureListener.apply(IgniteFutureImpl.java:215)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.util.future.IgniteFutureImpl$InternalFutureListener.apply(IgniteFutureImpl.java:179)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.util.future.IgniteFutureImpl.listen(IgniteFutureImpl.java:71)
 ~[ignite-core-2.7.0.jar:2.7.0]
    ...
Caused by: org.apache.ignite.cluster.ClusterTopologyException: Cannot run 
update query. Node must own all the necessary partitions.
    at 
org.apache.ignite.internal.util.IgniteUtils$7.apply(IgniteUtils.java:888) 
~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.util.IgniteUtils$7.apply(IgniteUtils.java:886) 
~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.IgniteCacheFutureImpl.convertException(IgniteCacheFutureImpl.java:62)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:137)
 ~[ignite-core-2.7.0.jar:2.7.0]
    ...
Caused by: org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: 
Cannot run update query. Node must own all the necessary partitions.
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxAbstractEnlistFuture.checkPartitions(GridDhtTxAbstractEnlistFuture.java:922)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxAbstractEnlistFuture.init(GridDhtTxAbstractEnlistFuture.java:336)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.enlistLocal(GridNearTxEnlistFuture.java:518)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.sendBatch(GridNearTxEnlistFuture.java:413)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.sendNextBatches(GridNearTxEnlistFuture.java:168)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.map(GridNearTxEnlistFuture.java:144)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxAbstractEnlistFuture.init(GridNearTxAbstractEnlistFuture.java:241)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.updateAsync(GridNearTxLocal.java:2099)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.mvccRemoveAllAsync0(GridNearTxLocal.java:1976)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1689)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:554)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$40.op(GridCacheAdapter.java:3174)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter$AsyncOp.op(GridCacheAdapter.java:5288)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.asyncOp(GridCacheAdapter.java:4450)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.asyncOp(GridCacheAdapter.java:4345)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAllAsync0(GridCacheAdapter.java:3172)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAllAsync(GridCacheAdapter.java:3159)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.removeAllAsync(IgniteCacheProxyImpl.java:1342)
 ~[ignite-core-2.7.0.jar:2.7.0]
    at 
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.removeAllAsync(GatewayProtectedCacheProxy.java:1072)
 ~[ignite-core-2.7.0.jar:2.7.0]
    ... {code}
Given that we use Ignite embedded into java application, it probably gets shut 
down uncleanly a lot in development. This is typically a single-node machine. 
Backup count is set to 1, but there is only one node anyway (so I'm not sure 
why partition would be MOVING any time anyway).

I set a breakpoint in GridDhtTxAbstractEnlistFuture.checkPartitions and found 
the offending partitions had a status of MOVING.

I suspect this might also be the cause for sometimes IgniteCache.get(x) and 
IgniteCache.containsKey(x) returning null and false respectively despite the 
cache certainly containing the key x with a non-null value (i.e. 
cache.containsKey(cache.iterator().next().getKey()) returns false).

resetLostPartitions probably has no effect in this case?

> Partition may restore wrong MOVING state during crash recovery
> --------------------------------------------------------------
>
>                 Key: IGNITE-10226
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10226
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.4
>            Reporter: Pavel Kovalenko
>            Assignee: Pavel Kovalenko
>            Priority: Major
>             Fix For: 2.8
>
>
> The way to get it exists only in versions that don't have IGNITE-9420:
> 1) Start cache, upload some data to partitions, forceCheckpoint
> 2) Start uploading additional data. Kill node. Node should be killed with 
> skipping last checkpoint, or during checkpoint mark phase.
> 3) Re-start node. The crash recovery process for partitions started. When we 
> create partition during crash recovery (topology().forceCreatePartition()) we 
> log it's initial state to WAL. If we have any logical update relates to 
> partition we'll log wrong MOVING state to the end of current WAL. This state 
> will be considered as last valid when we process PartitionMetaStateRecord 
> record's during logical recovery. In "restorePartitionsState" phase this 
> state will be chosen as final and the partition will change to MOVING, even 
> in page memory it has OWNING or something else.
> To fix this problem in 2.4 - 2.7 versions, additional logging partition state 
> change to WAL during crash recovery (logical recovery) should be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-10226) Partition may restore wrong MOVING state during crash recovery

Reply via email to