[ 
https://issues.apache.org/jira/browse/IGNITE-22191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Tkalenko updated IGNITE-22191:
-------------------------------------
    Description: 
Need to fix the error by stacktrace:
{noformat}
[2024-05-07T15:31:12,023][ERROR][%irt_trtcfz_0%metastorage-watch-executor-3][FailureProcessor]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [], failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=java.lang.AssertionError: 8_part_1]]
java.lang.AssertionError: 8_part_1
    at 
org.apache.ignite.internal.index.IndexBuildController.mvPartitionStorage(IndexBuildController.java:345)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.scheduleBuildIndexAfterDisasterRecovery(IndexBuildController.java:314)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.lambda$tryScheduleBuildIndexesForNewPrimaryReplica$14(IndexBuildController.java:219)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:869) 
~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.tryScheduleBuildIndexesForNewPrimaryReplica(IndexBuildController.java:208)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.lambda$onPrimaryReplicaElected$11(IndexBuildController.java:187)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)
 ~[?:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.lambda$onPrimaryReplicaElected$12(IndexBuildController.java:187)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
 ~[?:?]
    at 
org.apache.ignite.internal.util.IgniteUtils.lambda$copyStateTo$9(IgniteUtils.java:1273)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
 ~[?:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.copyState(BaseVersionedValue.java:315)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.complete(BaseVersionedValue.java:201)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.IncrementalVersionedValue.lambda$completeInternal$2(IncrementalVersionedValue.java:256)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
 ~[?:?]
    at 
org.apache.ignite.internal.causality.IncrementalVersionedValue.completeInternal(IncrementalVersionedValue.java:256)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.IncrementalVersionedValue.lambda$dependingOn$0(IncrementalVersionedValue.java:76)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.lambda$notifyCompletionListeners$6(BaseVersionedValue.java:337)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
 ~[?:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.notifyCompletionListeners(BaseVersionedValue.java:332)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.complete(BaseVersionedValue.java:210)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
{noformat}

h2. What I managed to find out. 

The error can be reproduced in 
*org.apache.ignite.internal.rebalance.ItRebalanceTest#testRebalanceTablesCounterForZone*,
 not the first time.

What happens in the test:
# 3 nodes start.
# 3 tables are created with 3 partitions and 3 replicas.
# We are waiting for the end of rebalancing for all tables.
# We change the number of replicas to 2 for the zone.
# We are waiting for the rebalancing to complete for all tables.

Let's say we have nodes A, B and C.
The problem occurs when the number of replicas changes from 3 to 2, when node A 
has left the assignments for the partition, but an event that node A has been 
elected as the primary replica for the (lease holder) partition in which it is 
no longer in the assignments. Since, according to the logic of rebalancing, 
after a node leaves the assignments, the partition data is cleared, which is 
why we find ourselves in this situation.

In this ticket, in order for the tests to pass, I will add a check that if the 
storage is *null*, then we will not build indexes, but in ticket IGNITE-22202 
we will need to somehow fix this.

h2. What thoughts do I have on how to fix the problem:
# Correct the rebalancing logic; if a node is the primary replica for a 
partition, then do not throw it out of the assignments.
# Correct the logic for electing the primary replica; if a node is no longer in 
the partition assignments, then there is no need to elexct it as a primary 
replica or not generate an event about this or reelect it or something else.
# Improve the logic in *IndexBuildController* and when receiving a primary 
replica election event, check that if the node is not in the partition 
assignments, then ignore this event.

  was:
Need to fix the error by stacktrace:
{noformat}
[2024-05-07T15:31:12,023][ERROR][%irt_trtcfz_0%metastorage-watch-executor-3][FailureProcessor]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [], failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=java.lang.AssertionError: 8_part_1]]
java.lang.AssertionError: 8_part_1
    at 
org.apache.ignite.internal.index.IndexBuildController.mvPartitionStorage(IndexBuildController.java:345)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.scheduleBuildIndexAfterDisasterRecovery(IndexBuildController.java:314)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.lambda$tryScheduleBuildIndexesForNewPrimaryReplica$14(IndexBuildController.java:219)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:869) 
~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.tryScheduleBuildIndexesForNewPrimaryReplica(IndexBuildController.java:208)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.lambda$onPrimaryReplicaElected$11(IndexBuildController.java:187)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)
 ~[?:?]
    at 
org.apache.ignite.internal.index.IndexBuildController.lambda$onPrimaryReplicaElected$12(IndexBuildController.java:187)
 ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
 ~[?:?]
    at 
org.apache.ignite.internal.util.IgniteUtils.lambda$copyStateTo$9(IgniteUtils.java:1273)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
 ~[?:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.copyState(BaseVersionedValue.java:315)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.complete(BaseVersionedValue.java:201)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.IncrementalVersionedValue.lambda$completeInternal$2(IncrementalVersionedValue.java:256)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
 ~[?:?]
    at 
org.apache.ignite.internal.causality.IncrementalVersionedValue.completeInternal(IncrementalVersionedValue.java:256)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.IncrementalVersionedValue.lambda$dependingOn$0(IncrementalVersionedValue.java:76)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.lambda$notifyCompletionListeners$6(BaseVersionedValue.java:337)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
 ~[?:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.notifyCompletionListeners(BaseVersionedValue.java:332)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.causality.BaseVersionedValue.complete(BaseVersionedValue.java:210)
 ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
{noformat}

h2. What I managed to find out. 

The error can be reproduced in 
*org.apache.ignite.internal.rebalance.ItRebalanceTest#testRebalanceTablesCounterForZone*,
 not the first time.

What happens in the test:
# 3 nodes start.
# 3 tables are created with 3 partitions and 3 replicas.
# We are waiting for the end of rebalancing for all tables.
# We change the number of replicas to 2 for the zone.
# We are waiting for the rebalancing to complete for all tables.

Let's say we have nodes A, B and C.
The problem occurs when the number of replicas changes from 3 to 2, when node A 
has left the assignments for the partition, but an event that node A has been 
elected as the primary replica for the (lease holder) partition in which it is 
no longer in the assignments. Since, according to the logic of rebalancing, 
after a node leaves the assignments, the partition data is cleared, which is 
why we find ourselves in this situation.

In this ticket, in order for the tests to pass, I will add a check that if the 
storage is *null*, then we will not build indexes, but in ticket IGNITE-22202 
we will need to somehow fix this.


> Fix AssertionError in IndexBuildController#mvPartitionStorage
> -------------------------------------------------------------
>
>                 Key: IGNITE-22191
>                 URL: https://issues.apache.org/jira/browse/IGNITE-22191
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Kirill Tkalenko
>            Assignee: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> Need to fix the error by stacktrace:
> {noformat}
> [2024-05-07T15:31:12,023][ERROR][%irt_trtcfz_0%metastorage-watch-executor-3][FailureProcessor]
>  Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=NoOpFailureHandler [], failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=java.lang.AssertionError: 8_part_1]]
> java.lang.AssertionError: 8_part_1
>     at 
> org.apache.ignite.internal.index.IndexBuildController.mvPartitionStorage(IndexBuildController.java:345)
>  ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.index.IndexBuildController.scheduleBuildIndexAfterDisasterRecovery(IndexBuildController.java:314)
>  ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.index.IndexBuildController.lambda$tryScheduleBuildIndexesForNewPrimaryReplica$14(IndexBuildController.java:219)
>  ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:869) 
> ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.index.IndexBuildController.tryScheduleBuildIndexesForNewPrimaryReplica(IndexBuildController.java:208)
>  ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.index.IndexBuildController.lambda$onPrimaryReplicaElected$11(IndexBuildController.java:187)
>  ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.index.IndexBuildController.lambda$onPrimaryReplicaElected$12(IndexBuildController.java:187)
>  ~[ignite-index-3.0.0-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.util.IgniteUtils.lambda$copyStateTo$9(IgniteUtils.java:1273)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.causality.BaseVersionedValue.copyState(BaseVersionedValue.java:315)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.causality.BaseVersionedValue.complete(BaseVersionedValue.java:201)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.causality.IncrementalVersionedValue.lambda$completeInternal$2(IncrementalVersionedValue.java:256)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.causality.IncrementalVersionedValue.completeInternal(IncrementalVersionedValue.java:256)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.causality.IncrementalVersionedValue.lambda$dependingOn$0(IncrementalVersionedValue.java:76)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.causality.BaseVersionedValue.lambda$notifyCompletionListeners$6(BaseVersionedValue.java:337)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.causality.BaseVersionedValue.notifyCompletionListeners(BaseVersionedValue.java:332)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.causality.BaseVersionedValue.complete(BaseVersionedValue.java:210)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
> {noformat}
> h2. What I managed to find out. 
> The error can be reproduced in 
> *org.apache.ignite.internal.rebalance.ItRebalanceTest#testRebalanceTablesCounterForZone*,
>  not the first time.
> What happens in the test:
> # 3 nodes start.
> # 3 tables are created with 3 partitions and 3 replicas.
> # We are waiting for the end of rebalancing for all tables.
> # We change the number of replicas to 2 for the zone.
> # We are waiting for the rebalancing to complete for all tables.
> Let's say we have nodes A, B and C.
> The problem occurs when the number of replicas changes from 3 to 2, when node 
> A has left the assignments for the partition, but an event that node A has 
> been elected as the primary replica for the (lease holder) partition in which 
> it is no longer in the assignments. Since, according to the logic of 
> rebalancing, after a node leaves the assignments, the partition data is 
> cleared, which is why we find ourselves in this situation.
> In this ticket, in order for the tests to pass, I will add a check that if 
> the storage is *null*, then we will not build indexes, but in ticket 
> IGNITE-22202 we will need to somehow fix this.
> h2. What thoughts do I have on how to fix the problem:
> # Correct the rebalancing logic; if a node is the primary replica for a 
> partition, then do not throw it out of the assignments.
> # Correct the logic for electing the primary replica; if a node is no longer 
> in the partition assignments, then there is no need to elexct it as a primary 
> replica or not generate an event about this or reelect it or something else.
> # Improve the logic in *IndexBuildController* and when receiving a primary 
> replica election event, check that if the node is not in the partition 
> assignments, then ignore this event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to