[ 
https://issues.apache.org/jira/browse/HDFS-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295998#comment-14295998
 ] 

Konstantin Shvachko commented on HDFS-7695:
-------------------------------------------

This was partly investigated under HDFS-7611. The simptoms looked similar to 
the bug described there.
Different test cases are failing there on different runs, with the same 
exception
{code}
java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1200)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1825)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1786)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:89)
{code}
The test
- creates a file and starts adding data
- then aborts the stream
- creates a snapshot while file is not closed
- deletes the file without deleting the snapshot and
- restarts NameNode

The behavior I see from the logs (added extanded logging info) that on restart 
NN replays the edits acoording to the steps above. The block are then reported 
by DNs, but they remain having 0 replicas, and therefore NN cannot leave 
SafeMode.
The missing blocks are supposed to be present, because even though the file was 
deleted its snapshot was not. I do not understand why replicas are not added to 
the locations when they are reported.

> Intermittent failures in TestOpenFilesWithSnapshot
> --------------------------------------------------
>
>                 Key: HDFS-7695
>                 URL: https://issues.apache.org/jira/browse/HDFS-7695
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.6.0
>            Reporter: Konstantin Shvachko
>
> This is to investigate intermittent failures of 
> {{TestOpenFilesWithSnapshot}}, which is timing out on the NameNode restart as 
> it is unable to leave SafeMode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to