[ 
https://issues.apache.org/jira/browse/HDFS-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500037#comment-13500037
 ] 

Inder SIngh commented on HDFS-2095:
-----------------------------------

Folks,

we are hitting this in production with the same kinda of effects mentioned 
here. We are running cdh3u3. 
Till the time the fix makes it into another update, can anyone suggest any 
mechanism to work-around this problem.


                
> org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check 
> storm making data node unavailable
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2095
>                 URL: https://issues.apache.org/jira/browse/HDFS-2095
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.21.0
>            Reporter: Vitalii Tymchyshyn
>         Attachments: patch2.diff, patch.diff, pathch3.diff
>
>
> I can see that if data node receives some IO error, this can cause checkDir 
> storm.
> What I mean:
> 1) any error produces DataNode.checkDiskError call
> 2) this call locks volume:
>  java.lang.Thread.State: RUNNABLE
>        at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
>        at java.io.File.exists(File.java:733)
>        at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65)
>        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617)
>        - locked <0x000000080a8faec0> (a 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 3) This produces timeouts on other calls, e.g.
> 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> checkDiskError: exception:
> java.io.InterruptedIOException
>        at java.io.FileOutputStream.writeBytes(Native Method)
>        at java.io.FileOutputStream.write(FileOutputStream.java:260)
>        at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 4) This, in turn, produces more "dir check calls".
> 5) All the cluster works very slow because of half-working node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to