[
https://issues.apache.org/jira/browse/HDFS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111693#comment-15111693
]
Mingliang Liu commented on HDFS-9672:
-------------------------------------
The test sets the lease period (which should not happen in real case as the
lease expired period is immutable), and triggers the Monitor in
{{LeaseManager}} for lease recovery (via {{LeaseManager#checkLeases}}) and edit
log sync (via {{FSEditLog#logSync()}}). The tests then sleep for 2 seconds
before checking that the lease holder should now be the NN. After this, the
test simply restarts the NN, which will close the edit log. See the following
code:
{code:title=org.apache.hadoop.hdfs.TestLeaseRecovery2.hardLeaseRecoveryRestartHelper.java}
489 // set the hard limit to be 1 second
490 cluster.setLeasePeriod(LONG_LEASE_PERIOD, SHORT_LEASE_PERIOD);
491
492 // Make sure lease recovery begins.
493 Thread.sleep(HdfsServerConstants.NAMENODE_LEASE_RECHECK_INTERVAL * 2);
494
495 checkLease(fileStr, size);
496
497 cluster.restartNameNode(false);
{code}
There are two problems here.
1. If the lease recovery has not begun after main thread's 2s sleep, the
{{checkLease(fileStr, size);}} in LOC 495 will fail. It happens intermittently.
2. There is a data race between stopping NN and {{LeaseManager$Monitor}}
flushing edit logs. If the flush fails, the journal set will be disabled. When
NN closes the edit log and calls {{FSEditLog#logSync()}}, there will not be
enough journals to persistent storage, 'cause all of them were disabled.
> o.a.h.hdfs.TestLeaseRecovery2 fails intermittently
> --------------------------------------------------
>
> Key: HDFS-9672
> URL: https://issues.apache.org/jira/browse/HDFS-9672
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: test
> Affects Versions: 3.0.0
> Reporter: Mingliang Liu
> Assignee: Mingliang Liu
>
> It fails in recent builds, see:
> https://builds.apache.org/job/PreCommit-HDFS-Build/14177/testReport/org.apache.hadoop.hdfs/
> https://builds.apache.org/job/PreCommit-HDFS-Build/14147/testReport/org.apache.hadoop.hdfs/
> Failing test methods include:
> *
> org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecoveryWithRenameAfterNameNodeRestart
> * org.apache.hadoop.hdfs.TestLeaseRecovery2.testLeaseRecoverByAnotherUser
> * org.apache.hadoop.hdfs.TestLeaseRecovery2.testHardLeaseRecovery
> *
> org.apache.hadoop.hdfs.TestLeaseRecovery2.org.apache.hadoop.hdfs.TestLeaseRecovery2
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)