[
https://issues.apache.org/jira/browse/IGNITE-19238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709637#comment-17709637
]
Denis Chudov commented on IGNITE-19238:
---------------------------------------
[~alapin] LGTM.
> ItDataTypesTest and ItCreateTableDdlTest are flaky
> --------------------------------------------------
>
> Key: IGNITE-19238
> URL: https://issues.apache.org/jira/browse/IGNITE-19238
> Project: Ignite
> Issue Type: Bug
> Reporter: Alexander Lapin
> Assignee: Alexander Lapin
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Attachments: Снимок экрана от 2023-04-06 10-39-32.png
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> h3. Description & Root cause
> 1. ItDataTypesTest is flaky because previous ItCreateTableDdlTest tests
> failed to stop replicas on node stop:
> !Снимок экрана от 2023-04-06 10-39-32.png!
> {code:java}
> java.lang.AssertionError: There are replicas alive
> [replicas=[b86c60a8-4ea3-4592-abef-6438cfc4cdb2_part_21,
> b86c60a8-4ea3-4592-abef-6438cfc4cdb2_part_6,
> b86c60a8-4ea3-4592-abef-6438cfc4cdb2_part_13,
> b86c60a8-4ea3-4592-abef-6438cfc4cdb2_part_8,
> b86c60a8-4ea3-4592-abef-6438cfc4cdb2_part_9,
> b86c60a8-4ea3-4592-abef-6438cfc4cdb2_part_11]]
> at
> org.apache.ignite.internal.replicator.ReplicaManager.stop(ReplicaManager.java:341)
> at
> org.apache.ignite.internal.app.LifecycleManager.lambda$stopAllComponents$1(LifecycleManager.java:133)
> at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
> at
> org.apache.ignite.internal.app.LifecycleManager.stopAllComponents(LifecycleManager.java:131)
> at
> org.apache.ignite.internal.app.LifecycleManager.stopNode(LifecycleManager.java:115){code}
> 2. The reason why we failed to stop replicas is the race between
> tablesToStopInCaseOfError cleanup and adding tables to tablesByIdVv.
> On TableManager stop, we stop and cleanup all table resources like replicas
> and raft nodes
> {code:java}
> public void stop() {
> ...
> Map<UUID, TableImpl> tables = tablesByIdVv.latest(); // 1*
> cleanUpTablesResources(tables);
> cleanUpTablesResources(tablesToStopInCaseOfError);
> ...
> }{code}
> where tablesToStopInCaseOfError is a sort of pending tables list which one is
> cleared on cfg storage revision update.
> tablesByIdVv *listens same storage revision update event* in order to publish
> tables related to the given revision or in other words make such tables
> accessible from tablesByIdVv.latest(); that one that is used in order to
> retrieve tables for cleanup on components stop (see // 1* above)
> {code:java}
> public TableManager(
> ...
> tablesByIdVv = new IncrementalVersionedValue<>(registry, HashMap::new);
> registry.accept(token -> {
> tablesToStopInCaseOfError.clear();
>
> return completedFuture(null);
> });
> {code}
> However inside IncrementalVersionedValue we have async storageRevision update
> processing
> {code:java}
> updaterFuture = updaterFuture.whenComplete((v, t) ->
> versionedValue.complete(causalityToken, localUpdaterFuture)); {code}
> As a result it's possible that we will clear tablesToStopInCaseOfError before
> publishing same revision tables to tablesByIdVv, so that we will miss that
> cleared tables in tablesByIdVv.latest() which is used in TableManager#stop.
> h3. Implementation Notes
> 1. First of all I've renamed tablesToStopInCaseOfError to pending tables,
> because they aren't only ...InCaseOfError.
> 2. I've also reworked tablesToStopInCaseOfError cleanup by substituting
> tablesToStopInCaseOfError.clear on revision change with
> {code:java}
> tablesByIdVv.get(causalityToken).thenAccept(ignored -> inBusyLock(busyLock,
> ()-> {
> pendingTables.remove(tblId);
> })); {code}
> meaning that we
> 2.1. remove specific table by id instead of ready.
> 2.2. do that removal on corresponding table publishing wihtin tablesByIdVv.
> 3. That means that at some point right after the publishing but before
> removal it's possible to have same table both within tablesByIdVv and
> pendingTables thus in order not to stop same table twice (which is safe by
> the way because of idempotentce) I've substituted
> {code:java}
> cleanUpTablesResources(tables);
> cleanUpTablesResources(tablesToStopInCaseOfError); {code}
> with
> {code:java}
> Map<UUID, TableImpl> tablesToStop =
> Stream.concat(tablesByIdVv.latest().entrySet().stream(),
> pendingTables.entrySet().stream()).
> collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, (v1,
> v2) -> v1));
> cleanUpTablesResources(tablesToStop); {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)