[ 
https://issues.apache.org/jira/browse/HDFS-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502607#comment-13502607
 ] 

Inder SIngh commented on HDFS-2095:
-----------------------------------

After looking at the code in detail i am wondering why do we need to do 
checkDiskError() for all IOExceptions in PacketResponder if it's JOB is to just 
send ACKS.

Connection Reset By Peer, Interrupted Channel Exception are all types of 
IOException and should not causes checkDisk().

What do you folks think about it?



                
> org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check 
> storm making data node unavailable
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2095
>                 URL: https://issues.apache.org/jira/browse/HDFS-2095
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.21.0
>            Reporter: Vitalii Tymchyshyn
>         Attachments: patch2.diff, patch.diff, pathch3.diff
>
>
> I can see that if data node receives some IO error, this can cause checkDir 
> storm.
> What I mean:
> 1) any error produces DataNode.checkDiskError call
> 2) this call locks volume:
>  java.lang.Thread.State: RUNNABLE
>        at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
>        at java.io.File.exists(File.java:733)
>        at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65)
>        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617)
>        - locked <0x000000080a8faec0> (a 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 3) This produces timeouts on other calls, e.g.
> 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> checkDiskError: exception:
> java.io.InterruptedIOException
>        at java.io.FileOutputStream.writeBytes(Native Method)
>        at java.io.FileOutputStream.write(FileOutputStream.java:260)
>        at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 4) This, in turn, produces more "dir check calls".
> 5) All the cluster works very slow because of half-working node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to