[
https://issues.apache.org/jira/browse/IGNITE-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vyacheslav Koptilin reassigned IGNITE-16718:
--------------------------------------------
Assignee: Alexander Lapin
> ItIgniteNodeRestartTest#testCfgGap is flaky
> -------------------------------------------
>
> Key: IGNITE-16718
> URL: https://issues.apache.org/jira/browse/IGNITE-16718
> Project: Ignite
> Issue Type: Bug
> Reporter: Denis Chudov
> Assignee: Alexander Lapin
> Priority: Blocker
> Labels: ignite-3
>
> ItIgniteNodeRestartTest#testCfgGap could be found in ignite-16362 branch.
> The reason of failure is null value instead of previously upserted key.
> With following (a bit simplified test: one table instead of two and one
> insertion instead of one hundred)
> {code:java}
> public void testCfgGap(TestInfo testInfo) {
> final int nodes = 4;
> for (int i = 0; i < nodes; i++) {
> startNode(testInfo, i);
> }
> createTableWithData(CLUSTER_NODES.get(0), "t1", nodes);
> String igniteName = CLUSTER_NODES.get(nodes - 1).name();
> log.info("Stopping the node.");
> IgnitionManager.stop(igniteName);
> checkTableWithData(CLUSTER_NODES.get(0), "t1");
> log.info("Starting the node.");
> Ignite newNode = IgnitionManager.start(igniteName, null,
> workDir.resolve(igniteName));
> CLUSTER_NODES.set(nodes - 1, newNode);
> checkTableWithData(CLUSTER_NODES.get(0), "t1");
> checkTableWithData(CLUSTER_NODES.get(nodes - 1), "t1");
> }
> private void checkTableWithData(Ignite ignite, String name) {
> ...
> for (int i = 0; i < 1; i++) {
> ...
> }
> }
> private void createTableWithData(Ignite ignite, String name, int replicas) {
> ...
> for (int i = 0; i < 1; i++) {
> ...
> }
> }{code}
> an inconsistent read is reproduced under the following flow:
> # table.keyValueView.put(k1)
> ## PartitionListener#handleUpsertCommand on Node B
> ## PartitionListener#handleUpsertCommand on Node C
> ## PartitionListener#handleUpsertCommand on Node D
> ## Please pay attention that upsert command wasn't handled on Node A, that's
> actually fine because B, C, D is a majority.
> # node D stop
> # nodeA.table.keyValueView().get(k1)
> ## PartitionListener#handleGetCommand on Node B // Means that B is a leader.
> # node D start
> ## PartitionListener#handleUpsertCommand on Node D // Inner raft rebalance
> # nodeA.table.keyValueView().get(k1)
> ## PartitionListener#handleGetCommand on Node B // Means that B is still a
> leader.
> # nodeD.table.keyValueView().get(k1)
> ## PartitionListener#handleGetCommand on Node *A* // Means that leader was
> changed to A and what's very important there was no handling upsert command
> on Node A.
> I've checked this by adding
> {code:java}
> private void handleUpsertCommand(UpsertCommand cmd) {
> System.out.println(">>> Upserted" +
> ((TxManagerImpl)txManager).clusterService.topologyService().localMember());
> ...
> } {code}
> and
> {code:java}
> private SingleRowResponse handleGetCommand(GetCommand cmd) {
> System.out.println(">>> Get" +
> ((TxManagerImpl)txManager).clusterService.topologyService().localMember());
> ...
> } {code}
>
> Further investigation items might be:
> * Checking whether k1 upsert was committed on node A or not, meaning that
> committing and applying to state machine are different steps, and according
> to RAFT it's not valid to be a leader with missing committed entries.
> * Checking why leader was changed between reads.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)