[
https://issues.apache.org/jira/browse/IGNITE-15655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556324#comment-17556324
]
Vyacheslav Koptilin commented on IGNITE-15655:
----------------------------------------------
Looks like we can unmute SQL tests based on test branch (which includes all SQL
related fixes)
https://ci.ignite.apache.org/buildConfiguration/ignite3_Test_IntegrationTests_ModuleRunner?branch=%3Cdefault%3E&mode=builds
> Random tests fail due to TimeoutException while waiting for a response
> ----------------------------------------------------------------------
>
> Key: IGNITE-15655
> URL: https://issues.apache.org/jira/browse/IGNITE-15655
> Project: Ignite
> Issue Type: Bug
> Reporter: Konstantin Orlov
> Assignee: Vyacheslav Koptilin
> Priority: Blocker
> Labels: ignite-3
> Attachments: _Integration_Tests_Integration_Tests_3305.log.zip,
> _Integration_Tests_Integration_Tests_3322.log.zip
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> h3. Problem
> Sometimes tests fails on main branch with an abstract "{{Caused by:
> java.util.concurrent.TimeoutException: Did not observe any item or terminal
> signal within 1000ms in 'source(MonoDefer)' (and no fallback has been
> configured)"}} reason.
> Need to investigate and fix the root cause of this fails.
>
> Test
> {{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testRenameColumn()}}
> in [^_Integration_Tests_Integration_Tests_3305.log.zip]
> Test
> {{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testMergeChangesAddDropAdd()}}
> in [^_Integration_Tests_Integration_Tests_3322.log.zip]
> h3. Upd 1
> Quick research revealed that the main reason for mentioned above
> _TimeoutException_ is a poor consistency of raft timeouts .
> Most of jraft timeouts are based on electionTimeoutMs.
> {code:java}
> // A follower would become a candidate if it doesn't receive any message
> // from the leader in |election_timeout_ms| milliseconds
> // Default: 1000 (1s)
> private int electionTimeoutMs = 1000; // follower to candidate timeout
> {code}
> For example both voteTime and electionTime use exact value of
> getElectionTimeoutMs (1000 ms):
> {code:java}
> String name = "JRaft-VoteTimer-" + suffix;
> this.voteTimer = new RepeatedTimer(name,
> options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};
> name = "JRaft-ElectionTimer-" + suffix;
> electionTimer = new RepeatedTimer(name,
> options.getElectionTimeoutMs(), timerFactory.getElectionTimer(name)) {...};
> {code}
> It actually means that Loza#TIMEOUT that was also initially set to 1000 ms
> just won't be enough in some cases. Seems that "operation timeout
> (LOZA#TIMEOUT)" should be an order of magnitude more than election timeout.
> During the implementation of [naive data
> rebalance|https://issues.apache.org/jira/browse/IGNITE-15491] LOZA#TIMEOUT
> was set to 10_000 to satisfy this need. Also a
> [ticket|https://issues.apache.org/jira/browse/IGNITE-15705] for timeouts
> research and consolidation was created.
> So, taking into consideration that raft operation timeout was already
> increased from 1 second to 10 it worth to just enable all tests that were
> disabled under IGNITE-15655 and check whether they are still failing.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)