[
https://issues.apache.org/jira/browse/IGNITE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Lapin reassigned IGNITE-20053:
----------------------------------------
Assignee: Alexander Lapin (was: Denis Chudov)
> Empty data nodes are returned by data nodes engine
> --------------------------------------------------
>
> Key: IGNITE-20053
> URL: https://issues.apache.org/jira/browse/IGNITE-20053
> Project: Ignite
> Issue Type: Bug
> Reporter: Denis Chudov
> Assignee: Alexander Lapin
> Priority: Major
> Labels: ignite-3
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY
> and it is refreshed by topology listener on topology events and stores
> logical topology. If the value stored by this key is null, then empty data
> nodes are returned from data nodes engine on data nodes calculation for a
> distribution zone. As a result, empty assignments are calculated for
> partitions, which leads to exception described in IGNITE-19466.
> Some integration tests, for example, ItRebalanceDistributedTest are flaky
> because of possible problems with value of
> DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by
> data nodes engine.
> Actually, the empty data nodes collection is a wrong result for this case
> because the current logical topology is not empty.
> h3. UPD #1
> *1.* The reason for empty data nodes assertion is race between join
> completion and thus firing logical topology updates and DZM start. Literally,
> it's required to put
> {code:java}
> nodes.stream().forEach(Node::waitWatches); {code}
> before
> {code:java}
> assertThat(
> allOf(nodes.get(0).cmgManager.onJoinReady(),
> nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
> willCompleteSuccessfully()
> ); {code}
> in
> org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.
>
> *2.* However, that's not the whole story. We also faced
> {code:java}
> Unable to initialize the cluster: null{code}
> because cmg init failed with TimeoutException because we start CMGManager
> asynchronously, which is incorrect. So if we move cmgManager to
> firstComponents that will solve the issue.
> {code:java}
> List<IgniteComponent> firstComponents = List.of(
> vaultManager,
> nodeCfgMgr,
> clusterService,
> raftManager,
> cmgManager
> ); {code}
>
> *3.* Still it's not the whole story. testTwoQueuedRebalances failed because
> we didn't retrieved an expected stable assignments after table creation
> {code:java}
> await(nodes.get(0).tableManager.createTableAsync(
> "TBL1",
> ZONE_1_NAME,
> tblChanger -> SchemaConfigurationConverter.convert(schTbl1,
> tblChanger)
> ));
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> The reason for that is that assignments calculation is an async process, so
> there are no guarantees that we will retrieve proper assignments right after
> table creation completes. So we might substitute
> {code:java}
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> with
> {code:java}
> assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1,
> 10_000));{code}
> Please pay attention that there are multiple places where we retrieve
> assignments and expect them to be ready.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)