[
https://issues.apache.org/jira/browse/HDFS-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872361#action_12872361
]
Konstantin Shvachko commented on HDFS-1161:
-------------------------------------------
Eli, I ran TestDataNodeVolumeFailure on my machine with and without your patch.
Without your patch it succeeds. With your patch it falls into an infinite loop
waiting for replication and finally times out.
The main difference is that after a disk error
{code}
[junit] org.apache.hadoop.util.DiskChecker$DiskErrorException: directory is not
writable:
/home/shv/kryptonite/hdfs/build/test/data/dfs/data/data3/current/finalized
[junit] at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:96)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:226)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:412)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:615)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1667)
[junit] at
org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:763)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockFile(FSDataset.java:1543)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:911)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getMetaFile(FSDataset.java:699)
[junit] at
org.apache.hadoop.hdfs.server.datanode.FSDataset.metaFileExists(FSDataset.java:799)
[junit] at
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:120)
[junit] at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.opReadBlock(DataXceiver.java:169)
[junit] at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opReadBlock(DataTransferProtocol.java:353)
[junit] at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:325)
[junit] at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:113)
[junit] at java.lang.Thread.run(Thread.java:619)
{code}
current code decides the DN should shutdown, while your patch decides to keep
DN running.
{code}
current> [junit] 2010-05-27 20:35:21,871 WARN datanode.DataNode
(DataNode.java:handleDiskError(771)) - DataNode.handleDiskError: Keep Running:
true
patched> [junit] 2010-05-27 19:22:45,636 WARN datanode.DataNode
(DataNode.java:handleDiskError(771)) - DataNode.handleDiskError: Keep Running:
false
{code}
This triggers different error reports to NN. Please try yourself.
> Make DN minimum valid volumes configurable
> ------------------------------------------
>
> Key: HDFS-1161
> URL: https://issues.apache.org/jira/browse/HDFS-1161
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node
> Affects Versions: 0.21.0, 0.22.0
> Reporter: Eli Collins
> Assignee: Eli Collins
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-1161-1.patch, hdfs-1161-2.patch, hdfs-1161-3.patch,
> hdfs-1161-4.patch, hdfs-1161-5.patch
>
>
> The minimum number of non-faulty volumes to keep the DN active is hard-coded
> to 1. It would be useful to allow users to configure this value so the DN
> can be taken offline when eg half of its disks fail, otherwise it doesn't get
> reported until it's down to it's final disk and suffering degraded
> performance.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.