[ 
https://issues.apache.org/jira/browse/HDFS-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593673#comment-13593673
 ] 

Rohit Kochar commented on HDFS-2095:
------------------------------------

Even we have hit this issue in production.In our case most of checkDiskError() 
invocations are happening due to network related exceptions in 
BlockReciever$PacketResponder.run() and DataNode$DataTransfer.run().
Since the current code has a common catch clause for all types of exceptions 
hence even for network related IO Exceptions chedkDiskError() in executed which 
lead to check storm thereby slowing the datanode.

One way to fix this could be to check in the catch clause that whether the 
class of exception belongs to "java.net" package and in those cases skip 
checking the disk.

Folks,
Please suggest if you think that above mentioned fix is the right approach to 
go about.
I can than submit the patch for this issue.

                
> org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check 
> storm making data node unavailable
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2095
>                 URL: https://issues.apache.org/jira/browse/HDFS-2095
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 0.21.0
>            Reporter: Vitalii Tymchyshyn
>            Assignee: Todd Lipcon
>         Attachments: patch2.diff, patch.diff, pathch3.diff
>
>
> I can see that if data node receives some IO error, this can cause checkDir 
> storm.
> What I mean:
> 1) any error produces DataNode.checkDiskError call
> 2) this call locks volume:
>  java.lang.Thread.State: RUNNABLE
>        at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
>        at java.io.File.exists(File.java:733)
>        at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65)
>        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617)
>        - locked <0x000000080a8faec0> (a 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 3) This produces timeouts on other calls, e.g.
> 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> checkDiskError: exception:
> java.io.InterruptedIOException
>        at java.io.FileOutputStream.writeBytes(Native Method)
>        at java.io.FileOutputStream.write(FileOutputStream.java:260)
>        at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 4) This, in turn, produces more "dir check calls".
> 5) All the cluster works very slow because of half-working node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to