[
https://issues.apache.org/jira/browse/HDDS-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265154#comment-17265154
]
Attila Doroszlai commented on HDDS-4668:
----------------------------------------
{code:title=https://github.com/apache/ozone/blob/159b0c61c3264c9c3c3e1e6e94ef853e31138557/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java#L154-L173}
154 │ // Get the latest db checkpoint from the leader OM.
155 │ OMTransactionInfo omTransactionInfo =
156 │
OMTransactionInfo.readTransactionInfo(leaderOM.getMetadataManager());
157 │ TermIndex leaderOMTermIndex =
158 │ TermIndex.valueOf(omTransactionInfo.getTerm(),
159 │ omTransactionInfo.getTransactionIndex());
160 │ long leaderOMSnaphsotIndex = leaderOMTermIndex.getIndex();
161 │ long leaderOMSnapshotTermIndex = leaderOMTermIndex.getTerm();
162 │
163 │ DBCheckpoint leaderDbCheckpoint =
164 │ leaderOM.getMetadataManager().getStore().getCheckpoint(false);
165 │
166 │ // Start the inactive OM
167 │ cluster.startInactiveOM(followerNodeId);
168 │
169 │ // The recently started OM should be lagging behind the leader OM.
170 │ long followerOMLastAppliedIndex =
171 │
followerOM.getOmRatisServer().getLastAppliedTermIndex().getIndex();
172 │ assertTrue(
173 │ followerOMLastAppliedIndex < leaderOMSnaphsotIndex);
{code}
[~hanishakoneru], the test fails here if follower OM is _not_ lagging behind
leader OM. Follower is brought up-to-date in the background by
{{OMDoubleBufferFlushThread}}, so failure/success depends on timing. Failure
can be consistently reproduced by adding few seconds sleep after the
{{cluster.startInactiveOM}} call.
The simplest way to fix the test is to remove the assertion about follower
index before installCheckpoint (lines 172-173). Do you have any suggestion how
to fix the test while keeping some kind of assertion about follower state?
(Sidenote: leader index is taken from transaction info (lines 155-156), while
follower index is last applied termIndex (lines 170-171), and these two seem to
differ by 1 for both leader and follower (199 vs. 200 in the test).)
> Intermittent failure in TestOMRatisSnapshots
> --------------------------------------------
>
> Key: HDDS-4668
> URL: https://issues.apache.org/jira/browse/HDDS-4668
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: test
> Affects Versions: 1.1.0
> Reporter: Attila Doroszlai
> Priority: Major
>
> {code:title=https://github.com/elek/ozone-build-results/blob/733562704a1beeaeb175d010d9e4f86c2f8fd23b/2021/01/09/5074/it-ozone/output.log#L123-L129}
> [ERROR] Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed:
> 178.509 s <<< FAILURE! - in org.apache.hadoop.ozone.om.TestOMRatisSnapshots
> [ERROR] testInstallSnapshot(org.apache.hadoop.ozone.om.TestOMRatisSnapshots)
> Time elapsed: 66.121 s <<< FAILURE!
> java.lang.AssertionError
> ...
> at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(TestOMRatisSnapshots.java:172)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]