[jira] [Commented] (HDDS-4668) Intermittent failure in TestOMRatisSnapshots

Attila Doroszlai (Jira) Thu, 14 Jan 2021 11:30:06 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265154#comment-17265154
 ]


Attila Doroszlai commented on HDDS-4668:
----------------------------------------

{code:title=https://github.com/apache/ozone/blob/159b0c61c3264c9c3c3e1e6e94ef853e31138557/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java#L154-L173}
 154   │     // Get the latest db checkpoint from the leader OM.
 155   │     OMTransactionInfo omTransactionInfo =
 156   │         
OMTransactionInfo.readTransactionInfo(leaderOM.getMetadataManager());
 157   │     TermIndex leaderOMTermIndex =
 158   │         TermIndex.valueOf(omTransactionInfo.getTerm(),
 159   │             omTransactionInfo.getTransactionIndex());
 160   │     long leaderOMSnaphsotIndex = leaderOMTermIndex.getIndex();
 161   │     long leaderOMSnapshotTermIndex = leaderOMTermIndex.getTerm();
 162   │
 163   │     DBCheckpoint leaderDbCheckpoint =
 164   │         leaderOM.getMetadataManager().getStore().getCheckpoint(false);
 165   │
 166   │     // Start the inactive OM
 167   │     cluster.startInactiveOM(followerNodeId);
 168   │
 169   │     // The recently started OM should be lagging behind the leader OM.
 170   │     long followerOMLastAppliedIndex =
 171   │         
followerOM.getOmRatisServer().getLastAppliedTermIndex().getIndex();
 172   │     assertTrue(
 173   │         followerOMLastAppliedIndex < leaderOMSnaphsotIndex);
{code}

[~hanishakoneru], the test fails here if follower OM is _not_ lagging behind 
leader OM.  Follower is brought up-to-date in the background by 
{{OMDoubleBufferFlushThread}}, so failure/success depends on timing.  Failure 
can be consistently reproduced by adding few seconds sleep after the 
{{cluster.startInactiveOM}} call.

The simplest way to fix the test is to remove the assertion about follower 
index before installCheckpoint (lines 172-173).  Do you have any suggestion how 
to fix the test while keeping some kind of assertion about follower state?

(Sidenote: leader index is taken from transaction info (lines 155-156), while 
follower index is last applied termIndex (lines 170-171), and these two seem to 
differ by 1 for both leader and follower (199 vs. 200 in the test).)

> Intermittent failure in TestOMRatisSnapshots
> --------------------------------------------
>
>                 Key: HDDS-4668
>                 URL: https://issues.apache.org/jira/browse/HDDS-4668
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.1.0
>            Reporter: Attila Doroszlai
>            Priority: Major
>
> {code:title=https://github.com/elek/ozone-build-results/blob/733562704a1beeaeb175d010d9e4f86c2f8fd23b/2021/01/09/5074/it-ozone/output.log#L123-L129}
> [ERROR] Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 178.509 s <<< FAILURE! - in org.apache.hadoop.ozone.om.TestOMRatisSnapshots
> [ERROR] testInstallSnapshot(org.apache.hadoop.ozone.om.TestOMRatisSnapshots)  
> Time elapsed: 66.121 s  <<< FAILURE!
> java.lang.AssertionError
>   ...
>   at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.testInstallSnapshot(TestOMRatisSnapshots.java:172)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-4668) Intermittent failure in TestOMRatisSnapshots

Reply via email to