[
https://issues.apache.org/jira/browse/IGNITE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Lapin updated IGNITE-20053:
-------------------------------------
Description:
There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and
it is refreshed by topology listener on topology events and stores logical
topology. If the value stored by this key is null, then empty data nodes are
returned from data nodes engine on data nodes calculation for a distribution
zone. As a result, empty assignments are calculated for partitions, which leads
to exception described in IGNITE-19466.
Some integration tests, for example, ItRebalanceDistributedTest are flaky
because of possible problems with value of
DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by data
nodes engine.
Actually, the empty data nodes collection is a wrong result for this case
because the current logical topology is not empty.
h3. UPD #1
*1.* The reason for empty data nodes assertion is race between join completion
and thus firing logical topology updates and DZM start. Literally, it's
required to put
{code:java}
nodes.stream().forEach(Node::waitWatches); {code}
before
{code:java}
assertThat(
allOf(nodes.get(0).cmgManager.onJoinReady(),
nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
willCompleteSuccessfully()
); {code}
in
org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.
*2.* However, that's not the whole story. We also faced
{code:java}
Unable to initialize the cluster: null{code}
because cmg init failed with TimeoutException because we start CMGManager
asynchronously, which is incorrect. So if we move cmgManager to firstComponents
that will solve the issue.
{code:java}
List<IgniteComponent> firstComponents = List.of(
vaultManager,
nodeCfgMgr,
clusterService,
raftManager,
cmgManager
); {code}
*3.* Still it's not the whole story. testTwoQueuedRebalances failed because we
didn't retrieved an expected stable assignments after table creation
{code:java}
await(nodes.get(0).tableManager.createTableAsync(
"TBL1",
ZONE_1_NAME,
tblChanger -> SchemaConfigurationConverter.convert(schTbl1, tblChanger)
));
assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
The reason for that is that assignments calculation is an async process, so
there are no guarantees that we will retrieve proper assignments right after
table creation completes. So we might substitute
{code:java}
assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
with
{code:java}
assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1,
1_000));{code}
was:
There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and
it is refreshed by topology listener on topology events and stores logical
topology. If the value stored by this key is null, then empty data nodes are
returned from data nodes engine on data nodes calculation for a distribution
zone. As a result, empty assignments are calculated for partitions, which leads
to exception described in IGNITE-19466.
Some integration tests, for example, ItRebalanceDistributedTest are flaky
because of possible problems with value of
DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by data
nodes engine.
Actually, the empty data nodes collection is a wrong result for this case
because the current logical topology is not empty.
h3. UPD #1
The reason for empty data nodes assertion is race between join completion and
thus firing logical topology updates and DZM start. Literally, it's required to
put
{code:java}
nodes.stream().forEach(Node::waitWatches); {code}
before
{code:java}
assertThat(
allOf(nodes.get(0).cmgManager.onJoinReady(),
nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
willCompleteSuccessfully()
); {code}
in
org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.
However, that's not the whole story. We also faced
{code:java}
Unable to initialize the cluster: null{code}
because cmg init failed with TimeoutException because we start CMGManager
asynchronously, which is incorrect. So if we move cmgManager to firstComponents
that will solve the issue.
{code:java}
List<IgniteComponent> firstComponents = List.of(
vaultManager,
nodeCfgMgr,
clusterService,
raftManager,
cmgManager
); {code}
> Empty data nodes are returned by data nodes engine
> --------------------------------------------------
>
> Key: IGNITE-20053
> URL: https://issues.apache.org/jira/browse/IGNITE-20053
> Project: Ignite
> Issue Type: Bug
> Reporter: Denis Chudov
> Assignee: Denis Chudov
> Priority: Major
> Labels: ignite-3
>
> There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY
> and it is refreshed by topology listener on topology events and stores
> logical topology. If the value stored by this key is null, then empty data
> nodes are returned from data nodes engine on data nodes calculation for a
> distribution zone. As a result, empty assignments are calculated for
> partitions, which leads to exception described in IGNITE-19466.
> Some integration tests, for example, ItRebalanceDistributedTest are flaky
> because of possible problems with value of
> DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by
> data nodes engine.
> Actually, the empty data nodes collection is a wrong result for this case
> because the current logical topology is not empty.
> h3. UPD #1
> *1.* The reason for empty data nodes assertion is race between join
> completion and thus firing logical topology updates and DZM start. Literally,
> it's required to put
> {code:java}
> nodes.stream().forEach(Node::waitWatches); {code}
> before
> {code:java}
> assertThat(
> allOf(nodes.get(0).cmgManager.onJoinReady(),
> nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
> willCompleteSuccessfully()
> ); {code}
> in
> org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.
>
> *2.* However, that's not the whole story. We also faced
> {code:java}
> Unable to initialize the cluster: null{code}
> because cmg init failed with TimeoutException because we start CMGManager
> asynchronously, which is incorrect. So if we move cmgManager to
> firstComponents that will solve the issue.
> {code:java}
> List<IgniteComponent> firstComponents = List.of(
> vaultManager,
> nodeCfgMgr,
> clusterService,
> raftManager,
> cmgManager
> ); {code}
>
> *3.* Still it's not the whole story. testTwoQueuedRebalances failed because
> we didn't retrieved an expected stable assignments after table creation
> {code:java}
> await(nodes.get(0).tableManager.createTableAsync(
> "TBL1",
> ZONE_1_NAME,
> tblChanger -> SchemaConfigurationConverter.convert(schTbl1,
> tblChanger)
> ));
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> The reason for that is that assignments calculation is an async process, so
> there are no guarantees that we will retrieve proper assignments right after
> table creation completes. So we might substitute
> {code:java}
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> with
> {code:java}
> assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1,
> 1_000));{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)