[
https://issues.apache.org/jira/browse/HDFS-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499133#comment-13499133
]
Todd Lipcon commented on HDFS-2095:
-----------------------------------
What do some other folks think about Vitalii's approach? It looks like the code
which recursively scans the entire drive is fairly ancient, not sure why we
decided to do that.
Another approach would be to do the scanning without holding the volume lock -
similar to how we now do block report gathering in branch-1 (HDFS-2379)
> org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check
> storm making data node unavailable
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-2095
> URL: https://issues.apache.org/jira/browse/HDFS-2095
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: data-node
> Affects Versions: 0.21.0
> Reporter: Vitalii Tymchyshyn
> Attachments: patch2.diff, patch.diff, pathch3.diff
>
>
> I can see that if data node receives some IO error, this can cause checkDir
> storm.
> What I mean:
> 1) any error produces DataNode.checkDiskError call
> 2) this call locks volume:
> java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
> at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
> at java.io.File.exists(File.java:733)
> at
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65)
> at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617)
> - locked <0x000000080a8faec0> (a
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202)
> at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
> at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
> at
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
> at
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
> at java.lang.Thread.run(Thread.java:619)
> 3) This produces timeouts on other calls, e.g.
> 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> checkDiskError: exception:
> java.io.InterruptedIOException
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183)
> at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
> at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
> at
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
> at
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
> at java.lang.Thread.run(Thread.java:619)
> 4) This, in turn, produces more "dir check calls".
> 5) All the cluster works very slow because of half-working node.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira