[
https://issues.apache.org/jira/browse/HDFS-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDFS-9631:
----------------------------------
Description:
I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite
frequently.
These tests ({{testParentDirWithUCFileDeleteWithSnapshot}},
{{testOpenFilesWithRename}}, {{testWithCheckpoint}}) are unable to reconnect to
the namenode after restart. It looks like the reconnection failed due to an
EOFException between data node and the name node.
{noformat}
FAILED:
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot
Error Message:
Timed out waiting for Mini HDFS Cluster to start
Stack Trace:
java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
at
org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
at
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
at
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
{noformat}
It appears that these three tests all call {{doWriteAndAbort()}}, which creates
files and then abort, and then set the parent directory with a snapshot, and
then delete the parent directory.
Interestingly, if the parent directory does not have a snapshot, the tests will
not fail.
The following test will fail intermittently:
{code:java}
public void testDeleteParentDirWithSnapShot() throws Exception {
Path path = new Path("/test");
fs.mkdirs(path);
fs.allowSnapshot(path);
Path file = new Path("/test/test/test2");
FSDataOutputStream out = fs.create(file);
for (int i = 0; i < 2; i++) {
long count = 0;
while (count < 1048576) {
out.writeBytes("hell");
count += 4;
}
}
((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
.of(SyncFlag.UPDATE_LENGTH));
DFSTestUtil.abortStream((DFSOutputStream) out.getWrappedStream());
Path file2 = new Path("/test/test/test3");
FSDataOutputStream out2 = fs.create(file2);
for (int i = 0; i < 2; i++) {
long count = 0;
while (count < 1048576) {
out2.writeBytes("hell");
count += 4;
}
}
((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
.of(SyncFlag.UPDATE_LENGTH));
DFSTestUtil.abortStream((DFSOutputStream) out2.getWrappedStream());
fs.createSnapshot(path, "s1");
// delete parent directory
fs.delete(new Path("/test/test"), true);
cluster.restartNameNode();
}
{code}
I am not sure if it's a test case issue, or something to do with snapshots.
was:
I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite
frequently.
These tests ({{testParentDirWithUCFileDeleteWithSnapshot}},
{{testOpenFilesWithRename}}, {{testWithCheckpoint}}) are unable to reconnect to
the namenode after restart. It looks like the reconnection failed due to an
EOFException between data node and the name node.
{noformat}
FAILED:
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot
Error Message:
Timed out waiting for Mini HDFS Cluster to start
Stack Trace:
java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
at
org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
at
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
at
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
{noformat}
It appears that these three tests all call doWriteAndAbort(), which creates
files and then abort, and then set the parent directory with a snapshot, and
then delete the parent directory.
Interestingly, if the parent directory does not have a snapshot, the tests will
not fail.
The following test will fail intermittently:
{code:java}
public void testDeleteParentDirWithSnapShot() throws Exception {
Path path = new Path("/test");
fs.mkdirs(path);
fs.allowSnapshot(path);
Path file = new Path("/test/test/test2");
FSDataOutputStream out = fs.create(file);
for (int i = 0; i < 2; i++) {
long count = 0;
while (count < 1048576) {
out.writeBytes("hell");
count += 4;
}
}
((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
.of(SyncFlag.UPDATE_LENGTH));
DFSTestUtil.abortStream((DFSOutputStream) out.getWrappedStream());
Path file2 = new Path("/test/test/test3");
FSDataOutputStream out2 = fs.create(file2);
for (int i = 0; i < 2; i++) {
long count = 0;
while (count < 1048576) {
out2.writeBytes("hell");
count += 4;
}
}
((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
.of(SyncFlag.UPDATE_LENGTH));
DFSTestUtil.abortStream((DFSOutputStream) out2.getWrappedStream());
fs.createSnapshot(path, "s1");
// delete parent directory
fs.delete(new Path("/test/test"), true);
cluster.restartNameNode();
}
{code}
I am not sure if it's a test case issue, or something to do with snapshots.
> Restarting namenode after deleting a directory with snapshot will fail
> ----------------------------------------------------------------------
>
> Key: HDFS-9631
> URL: https://issues.apache.org/jira/browse/HDFS-9631
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
>
> I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite
> frequently.
> These tests ({{testParentDirWithUCFileDeleteWithSnapshot}},
> {{testOpenFilesWithRename}}, {{testWithCheckpoint}}) are unable to reconnect
> to the namenode after restart. It looks like the reconnection failed due to
> an EOFException between data node and the name node.
> {noformat}
> FAILED:
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot
> Error Message:
> Timed out waiting for Mini HDFS Cluster to start
> Stack Trace:
> java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
> at
> org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
> at
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
> at
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
> {noformat}
> It appears that these three tests all call {{doWriteAndAbort()}}, which
> creates files and then abort, and then set the parent directory with a
> snapshot, and then delete the parent directory.
> Interestingly, if the parent directory does not have a snapshot, the tests
> will not fail.
> The following test will fail intermittently:
> {code:java}
> public void testDeleteParentDirWithSnapShot() throws Exception {
> Path path = new Path("/test");
> fs.mkdirs(path);
> fs.allowSnapshot(path);
> Path file = new Path("/test/test/test2");
> FSDataOutputStream out = fs.create(file);
> for (int i = 0; i < 2; i++) {
> long count = 0;
> while (count < 1048576) {
> out.writeBytes("hell");
> count += 4;
> }
> }
> ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
> .of(SyncFlag.UPDATE_LENGTH));
> DFSTestUtil.abortStream((DFSOutputStream) out.getWrappedStream());
> Path file2 = new Path("/test/test/test3");
> FSDataOutputStream out2 = fs.create(file2);
> for (int i = 0; i < 2; i++) {
> long count = 0;
> while (count < 1048576) {
> out2.writeBytes("hell");
> count += 4;
> }
> }
> ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
> .of(SyncFlag.UPDATE_LENGTH));
> DFSTestUtil.abortStream((DFSOutputStream) out2.getWrappedStream());
> fs.createSnapshot(path, "s1");
> // delete parent directory
> fs.delete(new Path("/test/test"), true);
> cluster.restartNameNode();
> }
> {code}
> I am not sure if it's a test case issue, or something to do with snapshots.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)