[ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484961 ]
Hairong Kuang commented on HADOOP-1170: --------------------------------------- I agree that it is too costly to call checkDirs on every I/O operation. A background thread that periodically does the sanity check would be nicer. The patch should also clean up the code that does the error handling. > Very high CPU usage on data nodes because of FSDataset.checkDataDir() on > every connect > -------------------------------------------------------------------------------------- > > Key: HADOOP-1170 > URL: https://issues.apache.org/jira/browse/HADOOP-1170 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.11.2 > Reporter: Igor Bolotin > Attachments: 1170.patch > > > While investigating performance issues in our Hadoop DFS/MapReduce cluster I > saw very high CPU usage by DataNode processes. > Stack trace showed following on most of the data nodes: > "[EMAIL PROTECTED]" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable > [0x000000004166a000..0x000000004166ac00] > at java.io.UnixFileSystem.checkAccess(Native Method) > at java.io.File.canRead(File.java:660) > at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > at > org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258) > at > org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339) > - locked <0x00002aaab6fb8960> (a > org.apache.hadoop.dfs.FSDataset$FSVolumeSet) > at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544) > at > org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535) > at java.lang.Thread.run(Thread.java:595) > I understand that it would take a while to check the entire data directory - > as we have some 180,000 blocks/files in there. But what really bothers me > that from the code I see that this check is executed for every client > connection to the DataNode - which also means for every task executed in the > cluster. Once I commented out the check and restarted datanodes - the > performance went up and CPU usage went down to reasonable level. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.