[ 
https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490156
 ] 

eric baldeschwieler commented on HADOOP-1170:
---------------------------------------------

The thing to understand is that we can not upgrade our cluster to HEAD with 
this patch committed.  This patch breaks us.  We'll try to move forward in the 
new issue rather than advocating rolling this back, but this patch did not 
address the concerns we raised in this bug and so we have a problem.  I hope we 
can avoid this in the future.

I'm not advocating rolling back because I agree that these checks were not the 
appropriate solution to the disk problems they solved.

In case the context isn't clear, we frequently see individual drives go read 
only on our machines.  This check was inserted to allow this problem to be 
detected early and avoid failed jobs cause by write failures.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on 
> every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>             Fix For: 0.13.0
>
>         Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I 
> saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "[EMAIL PROTECTED]" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable 
> [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at 
> org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - 
> as we have some 180,000 blocks/files in there. But what really bothers me 
> that from the code I see that this check is executed for every client 
> connection to the DataNode - which also means for every task executed in the 
> cluster. Once I commented out the check and restarted datanodes - the 
> performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to