[jira] [Commented] (IGNITE-26918) ItDisasterRecoveryControllerRestartPartitionsWithCleanupTest.testRestartTablePartitionsWithCleanupAllPartitions is flaky

Mirza Aliev (Jira) Mon, 10 Nov 2025 01:23:11 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036763#comment-18036763
 ]


Mirza Aliev commented on IGNITE-26918:
--------------------------------------

What was done: 
* We fixed a bug in 
`TableManager#onZoneReplicaDestroyed` in 

```
CompletableFuture<?>[] futures = 
zoneTablesRawSet(zonePartitionId.zoneId()).stream()
                            .map(table -> supplyAsync(
                                    () -> inBusyLockAsync(
                                            busyLock,
                                            () -> stopAndDestroyTablePartition(
                                                    new 
TablePartitionId(table.tableId(), zonePartitionId.partitionId()),
                                                    parameters.causalityToken()
                                            )
                                    ),
                                    ioExecutor).thenCompose(identity()))
                            .toArray(CompletableFuture[]::new);

                    return allOf(futures);


```


where `.thenCompose(identity())` was missed, and that caused the situation that 
zone resources were treated as stopped but table storages actually were in the 
middle of destroying  

* In `PartitionReplicaLifecycleManager#stopPartitionInternal` code
```    
                         return replicaMgr.stopReplica(zonePartitionId)
                                    .thenCompose(replicaWasStopped -> {
                                        
afterReplicaStopAction.accept(replicaWasStopped);

                                        if (!replicaWasStopped) {
                                            return nullCompletedFuture();
                                        }

                                        
replicationGroupIds.remove(zonePartitionId);

                                        return 
fireEvent(afterReplicaStoppedEvent, eventParameters);
                                    });
```
lack `thenComposeAsync`, this bug led to 
`[async-destroy-group-22-partition-2-task-5 is not allowed to do STORAGE_WRITE]`

* The code 
```
        CompletableFuture<Void> operationFuture = new CompletableFuture<Void>()
                .whenComplete((v, throwable) -> 
ongoingOperationsById.remove(operationId))
                .orTimeout(TIMEOUT_SECONDS, TimeUnit.SECONDS);
``` 
had a bug, `.whenComplete((v, throwable) -> 
ongoingOperationsById.remove(operationId))` has never been invoked

*  Wrong node in `testRestartPartitionsWithCleanUp` test was chosen for 
`restartPartitionsWithCleanup`. The problem is that future that is returned 
from `restartPartitionsWithCleanup` is treated like really completed only when 
`restartPartitionsWithCleanup` invoked on the node that is passed to the 
method. 

> ItDisasterRecoveryControllerRestartPartitionsWithCleanupTest.testRestartTablePartitionsWithCleanupAllPartitions
>  is flaky
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-26918
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26918
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexander Lapin
>            Assignee: Mirza Aliev
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain, ignite-3
>         Attachments: _Integration_Tests_Module_REST_23787.log.zip
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> io.micronaut.http.client.exceptions.HttpClientResponseException: 
> {"title":"Internal Server Error","status":500,"detail":"tableId=22, 
> partitionId=23"}  at 
> app//io.micronaut.http.client.netty.DefaultHttpClient$FullHttpResponseHandler.makeErrorFromRequestBody(DefaultHttpClient.java:2232)
> ...
> Caused by: java.lang.AssertionError: tableId=22, partitionId=23  at   
> org.apache.ignite.internal.table.distributed.TableManager.getPartitionStorages(TableManager.java:2762)
>  {code}
>  [TC 
> Link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleRest/9602231]
>  
> Please, pay attention, that besides aforementioned assertion exception 
> there's a Critical System Error
> {code:java}
> 18:32:45     Caused by: java.util.concurrent.CompletionException: 
> org.apache.ignite.internal.storage.StorageDestroyedException: IGN-CMN-65535 
> Storage is in the process of being destroyed or already destroyed: 
> [tableId=22, partitionId=23] TraceId:78319bfc
> 18:32:45       at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
> 18:32:45       at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
> 18:32:45       at 
> java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:791)
> 18:32:45       ... 4 more
> 18:32:45     Caused by: 
> org.apache.ignite.internal.storage.StorageDestroyedException: Storage is in 
> the process of being destroyed or already destroyed: [tableId=22, 
> partitionId=23]
> 18:32:45       at 
> org.apache.ignite.internal.storage.util.StorageUtils.throwExceptionDependingOnStorageState(StorageUtils.java:147)
> 18:32:45       at 
> org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busy(AbstractPageMemoryMvPartitionStorage.java:712)
> 18:32:45       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.committedGroupConfiguration(PersistentPageMemoryMvPartitionStorage.java:287)
> 18:32:45       at 
> org.apache.ignite.internal.storage.ThreadAssertingMvPartitionStorage.committedGroupConfiguration(ThreadAssertingMvPartitionStorage.java:79)
> 18:32:45       at 
> org.apache.ignite.internal.table.distributed.raft.snapshot.SnapshotAwarePartitionDataStorage.committedGroupConfiguration(SnapshotAwarePartitionDataStorage.java:135)
> 18:32:45       at 
> org.apache.ignite.internal.table.distributed.raft.PartitionListener.<init>(PartitionListener.java:229)
> 18:32:45       at 
> org.apache.ignite.internal.table.distributed.TableManager.preparePartitionResourcesAndLoadToZoneReplicaBusy(TableManager.java:1076)
> 18:32:45       at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$createPartitionsAndLoadResourcesToZoneReplica$14(TableManager.java:779)
> 18:32:45       at 
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:920)
> 18:32:45       at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$createPartitionsAndLoadResourcesToZoneReplica$15(TableManager.java:763)
> 18:32:45       at 
> java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:787)
>  {code}
> [TC 
> link|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/9601666?showLog=9601643_33483_103.1233&logFilter=debug&logView=flowAware]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-26918) ItDisasterRecoveryControllerRestartPartitionsWithCleanupTest.testRestartTablePartitionsWithCleanupAllPartitions is flaky

Reply via email to