[ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490156 ]
eric baldeschwieler commented on HADOOP-1170: --------------------------------------------- The thing to understand is that we can not upgrade our cluster to HEAD with this patch committed. This patch breaks us. We'll try to move forward in the new issue rather than advocating rolling this back, but this patch did not address the concerns we raised in this bug and so we have a problem. I hope we can avoid this in the future. I'm not advocating rolling back because I agree that these checks were not the appropriate solution to the disk problems they solved. In case the context isn't clear, we frequently see individual drives go read only on our machines. This check was inserted to allow this problem to be detected early and avoid failed jobs cause by write failures. > Very high CPU usage on data nodes because of FSDataset.checkDataDir() on > every connect > -------------------------------------------------------------------------------------- > > Key: HADOOP-1170 > URL: https://issues.apache.org/jira/browse/HADOOP-1170 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.11.2 > Reporter: Igor Bolotin > Fix For: 0.13.0 > > Attachments: 1170-v2.patch, 1170.patch > > > While investigating performance issues in our Hadoop DFS/MapReduce cluster I > saw very high CPU usage by DataNode processes. > Stack trace showed following on most of the data nodes: > "[EMAIL PROTECTED]" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable > [0x000000004166a000..0x000000004166ac00] > at java.io.UnixFileSystem.checkAccess(Native Method) > at java.io.File.canRead(File.java:660) > at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258) > at > org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339) > - locked <0x00002aaab6fb8960> (a > org.apache.hadoop.dfs.FSDataset$FSVolumeSet) > at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544) > at > org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535) > at java.lang.Thread.run(Thread.java:595) > I understand that it would take a while to check the entire data directory - > as we have some 180,000 blocks/files in there. But what really bothers me > that from the code I see that this check is executed for every client > connection to the DataNode - which also means for every task executed in the > cluster. Once I commented out the check and restarted datanodes - the > performance went up and CPU usage went down to reasonable level. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.