Mingliang Liu created HDFS-11030:
------------------------------------
Summary: TestDataNodeVolumeFailure#testVolumeFailure is flaky
(though passing)
Key: HDFS-11030
URL: https://issues.apache.org/jira/browse/HDFS-11030
Project: Hadoop HDFS
Issue Type: Sub-task
Components: datanode, test
Affects Versions: 2.7.0
Reporter: Mingliang Liu
Assignee: Mingliang Liu
TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the
blocks and files are replicated correctly.
To fail a volume, it deletes all the blocks and sets the data dir read only.
{code}
// fail the volume
// delete/make non-writable one of the directories (failed volume)
data_fail = new File(dataDir, "data3");
failedDir = MiniDFSCluster.getFinalizedDir(dataDir,
cluster.getNamesystem().getBlockPoolId());
if (failedDir.exists() &&
//!FileUtil.fullyDelete(failedDir)
!deteteBlocks(failedDir)
) {
throw new IOException("Could not delete hdfs directory '" + failedDir +
"'");
}
data_fail.setReadOnly();
failedDir.setReadOnly();
{code}
However, there are two bugs here:
- The {{failedDir}} directory for finalized blocks is not calculated correctly.
It should use {{data_fail}} instead of {{dataDir}} as the base directory.
- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that
there is no subdirectories in the data dir. This assumption was also noted in
the comments.
{quote}
// we use only small number of blocks to avoid creating subdirs in the data
dir..
{quote}
This is not true. On my local cluster and MiniDFSCluster, there will be
subdir0/subdir0/ two level directories regardless of the number of blocks.
These two bugs made the blocks not deleted.
To fail a volume, it also needs to trigger the DataNode removing the volume and
send block report to NN. This is basically in the {{triggerFailure()}} method.
{code}
/**
* go to each block on the 2nd DataNode until it fails...
* @param path
* @param size
* @throws IOException
*/
private void triggerFailure(String path, long size) throws IOException {
NamenodeProtocols nn = cluster.getNameNodeRpc();
List<LocatedBlock> locatedBlocks =
nn.getBlockLocations(path, 0, size).getLocatedBlocks();
for (LocatedBlock lb : locatedBlocks) {
DatanodeInfo dinfo = lb.getLocations()[1];
ExtendedBlock b = lb.getBlock();
try {
accessBlock(dinfo, lb);
} catch (IOException e) {
System.out.println("Failure triggered, on block: " + b.getBlockId() +
"; corresponding volume should be removed by now");
break;
}
}
}
{code}
Accessing those blocks will not trigger failures if the directory is read-only
(while the block files are all there). I ran the tests multiple times without
triggering this failure. We have to write new block files to the data
directories, or we should have deleted the blocks correctly.
This unit test has been there for years and it seldom fails, just because it's
never triggered a real volume failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]