[
https://issues.apache.org/jira/browse/HDFS-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369617#comment-15369617
]
Yiqun Lin commented on HDFS-10603:
----------------------------------
Thanks [~yzhangal] for reporting this. I tested this unit test many times in my
local env. Every time the test failed, I found it always costed some time in
processReport, the output log like these:
{code}
2016-07-10 20:04:56,585 [Block report processor] INFO
blockmanagement.BlockManager (BlockManager.java:processReport(2178)) -
Processing first storage report for DS-7cd4704a-096b-4f6d-aa4f-6f7f74798e53
from datanode 154b2810-2bb9-4ec5-bdb3-c8d38988adae
2016-07-10 20:04:56,586 [Block report processor] INFO BlockStateChange
(BlockManager.java:processReport(2229)) - BLOCK* processReport: from storage
DS-7cd4704a-096b-4f6d-aa4f-6f7f74798e53 node
DatanodeRegistration(127.0.0.1:49884,
datanodeUuid=154b2810-2bb9-4ec5-bdb3-c8d38988adae, infoPort=49886,
infoSecurePort=0, ipcPort=49887,
storageInfo=lv=-57;cid=testClusterID;nsid=653559418;c=1468152288554), blocks:
2, hasStaleStorage: true, processing time: 1 msecs, invalidatedBlocks: 0
2016-07-10 20:04:56,586 [Block report processor] INFO
blockmanagement.BlockManager (BlockManager.java:processReport(2178)) -
Processing first storage report for DS-3917a777-191f-448b-be88-2396c5308ab6
from datanode 154b2810-2bb9-4ec5-bdb3-c8d38988adae
2016-07-10 20:04:56,586 [Block report processor] INFO BlockStateChange
(BlockManager.java:processReport(2229)) - BLOCK* processReport: from storage
DS-3917a777-191f-448b-be88-2396c5308ab6 node
DatanodeRegistration(127.0.0.1:49884,
datanodeUuid=154b2810-2bb9-4ec5-bdb3-c8d38988adae, infoPort=49886,
infoSecurePort=0, ipcPort=49887,
storageInfo=lv=-57;cid=testClusterID;nsid=653559418;c=1468152288554), blocks:
2, hasStaleStorage: false, processing time: 1 msecs, invalidatedBlocks: 0
2016-07-10 20:04:56,587 [DataNode:
[[[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/,
[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data6/]]
heartbeating to localhost/127.0.0.1:49874] INFO datanode.DataNode
(BPServiceActor.java:blockReport(389)) - Successfully sent block report
0x665205cb9591e953, containing 2 storage report(s), of which we sent 2. The
reports had 4 total blocks and used 1 RPC(s). This took 1 msec to generate and
3 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-10 20:04:56,587 [DataNode:
[[[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/,
[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data6/]]
heartbeating to localhost/127.0.0.1:49874] INFO datanode.DataNode
(BPOfferService.java:processCommandFromActive(696)) - Got finalize command for
block pool BP-1517015184-127.0.0.1-1468152288554
2016-07-10 20:04:57,410 [main] WARN hdfs.MiniDFSCluster
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster
to start...
2016-07-10 20:04:58,417 [main] WARN hdfs.MiniDFSCluster
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster
to start...
2016-07-10 20:04:59,422 [main] WARN hdfs.MiniDFSCluster
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster
to start...
2016-07-10 20:05:00,423 [main] WARN hdfs.MiniDFSCluster
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster
to start...
2016-07-10 20:05:01,425 [main] WARN hdfs.MiniDFSCluster
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster
to start...
2016-07-10 20:05:02,430 [main] WARN hdfs.MiniDFSCluster
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster
to start...
2016-07-10 20:05:03,435 [main] ERROR hdfs.MiniDFSCluster
(MiniDFSCluster.java:waitClusterUp(1362)) - Timed out waiting for Mini HDFS
Cluster to start
{code}
And then the cluster start timeout. So I added a line
{{cluster.triggerBlockReports();}} before the operation
{{cluster.restartNameNode(true);}}, it will speed this. I tested again for this
test many times with my change, the result were all passed.
In addition, the failed test
{{TestOpenFilesWithSnapshot#testParentDirWithUCFileDeleteWithSnapShot}} that
showed in
HDFS-10512(https://builds.apache.org/job/PreCommit-HDFS-Build/16005/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
and the similar test
{{TestOpenFilesWithSnapshot#testUCFileDeleteWithSnapShot}} are both have this
problem. I also did a triggerBlockReports and savenamespace operations to speed
the cluster startup.
Post the patch from me.
> Flaky test
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testWithCheckpoint
> -------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10603
> URL: https://issues.apache.org/jira/browse/HDFS-10603
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs, namenode
> Reporter: Yongjun Zhang
>
> Test
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testWithCheckpoint
> may fail intermittently as
> {code}
> -------------------------------------------------------
> T E S T S
> -------------------------------------------------------
> Running
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
> Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 63.386 sec
> <<< FAILURE! - in
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
> testWithCheckpoint(org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot)
> Time elapsed: 15.092 sec <<< ERROR!
> java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
> at
> org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1363)
> at
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2041)
> at
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2011)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testWithCheckpoint(TestOpenFilesWithSnapshot.java:94)
> Results :
> Tests in error:
> TestOpenFilesWithSnapshot.testWithCheckpoint:94 ยป IO Timed out waiting for
> Min...
> Tests run: 7, Failures: 0, Errors: 1, Skipped: 0
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]