[ 
https://issues.apache.org/jira/browse/HDFS-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369617#comment-15369617
 ] 

Yiqun Lin commented on HDFS-10603:
----------------------------------

Thanks [~yzhangal] for reporting this. I tested this unit test many times in my 
local env. Every time the test failed, I found it always costed some time in 
processReport, the output  log like these:
{code}
2016-07-10 20:04:56,585 [Block report processor] INFO  
blockmanagement.BlockManager (BlockManager.java:processReport(2178)) - 
Processing first storage report for DS-7cd4704a-096b-4f6d-aa4f-6f7f74798e53 
from datanode 154b2810-2bb9-4ec5-bdb3-c8d38988adae
2016-07-10 20:04:56,586 [Block report processor] INFO  BlockStateChange 
(BlockManager.java:processReport(2229)) - BLOCK* processReport: from storage 
DS-7cd4704a-096b-4f6d-aa4f-6f7f74798e53 node 
DatanodeRegistration(127.0.0.1:49884, 
datanodeUuid=154b2810-2bb9-4ec5-bdb3-c8d38988adae, infoPort=49886, 
infoSecurePort=0, ipcPort=49887, 
storageInfo=lv=-57;cid=testClusterID;nsid=653559418;c=1468152288554), blocks: 
2, hasStaleStorage: true, processing time: 1 msecs, invalidatedBlocks: 0
2016-07-10 20:04:56,586 [Block report processor] INFO  
blockmanagement.BlockManager (BlockManager.java:processReport(2178)) - 
Processing first storage report for DS-3917a777-191f-448b-be88-2396c5308ab6 
from datanode 154b2810-2bb9-4ec5-bdb3-c8d38988adae
2016-07-10 20:04:56,586 [Block report processor] INFO  BlockStateChange 
(BlockManager.java:processReport(2229)) - BLOCK* processReport: from storage 
DS-3917a777-191f-448b-be88-2396c5308ab6 node 
DatanodeRegistration(127.0.0.1:49884, 
datanodeUuid=154b2810-2bb9-4ec5-bdb3-c8d38988adae, infoPort=49886, 
infoSecurePort=0, ipcPort=49887, 
storageInfo=lv=-57;cid=testClusterID;nsid=653559418;c=1468152288554), blocks: 
2, hasStaleStorage: false, processing time: 1 msecs, invalidatedBlocks: 0
2016-07-10 20:04:56,587 [DataNode: 
[[[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/,
 
[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data6/]]
  heartbeating to localhost/127.0.0.1:49874] INFO  datanode.DataNode 
(BPServiceActor.java:blockReport(389)) - Successfully sent block report 
0x665205cb9591e953,  containing 2 storage report(s), of which we sent 2. The 
reports had 4 total blocks and used 1 RPC(s). This took 1 msec to generate and 
3 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-10 20:04:56,587 [DataNode: 
[[[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/,
 
[DISK]file:/Users/lyq/Documents/work-project/hadoop-trunk/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data6/]]
  heartbeating to localhost/127.0.0.1:49874] INFO  datanode.DataNode 
(BPOfferService.java:processCommandFromActive(696)) - Got finalize command for 
block pool BP-1517015184-127.0.0.1-1468152288554
2016-07-10 20:04:57,410 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster 
to start...
2016-07-10 20:04:58,417 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster 
to start...
2016-07-10 20:04:59,422 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster 
to start...
2016-07-10 20:05:00,423 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster 
to start...
2016-07-10 20:05:01,425 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster 
to start...
2016-07-10 20:05:02,430 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1356)) - Waiting for the Mini HDFS Cluster 
to start...
2016-07-10 20:05:03,435 [main] ERROR hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1362)) - Timed out waiting for Mini HDFS 
Cluster to start
{code}
And then the cluster start timeout. So I added a line 
{{cluster.triggerBlockReports();}} before the operation 
{{cluster.restartNameNode(true);}}, it will speed this. I tested again for this 
test many times with my change, the result were all passed.

In addition, the failed test 
{{TestOpenFilesWithSnapshot#testParentDirWithUCFileDeleteWithSnapShot}} that 
showed in 
HDFS-10512(https://builds.apache.org/job/PreCommit-HDFS-Build/16005/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 and the similar test 
{{TestOpenFilesWithSnapshot#testUCFileDeleteWithSnapShot}} are both have this 
problem. I also did a triggerBlockReports and savenamespace operations to speed 
the cluster startup.

Post the patch from me.

> Flaky test 
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testWithCheckpoint
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10603
>                 URL: https://issues.apache.org/jira/browse/HDFS-10603
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>            Reporter: Yongjun Zhang
>
> Test 
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testWithCheckpoint
> may fail intermittently as
> {code}
> -------------------------------------------------------
>  T E S T S
> -------------------------------------------------------
> Running 
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
> Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 63.386 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
> testWithCheckpoint(org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot)
>   Time elapsed: 15.092 sec  <<< ERROR!
> java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1363)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2041)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2011)
>       at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testWithCheckpoint(TestOpenFilesWithSnapshot.java:94)
> Results :
> Tests in error: 
>   TestOpenFilesWithSnapshot.testWithCheckpoint:94 ยป IO Timed out waiting for 
> Min...
> Tests run: 7, Failures: 0, Errors: 1, Skipped: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to