While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
Stack trace showed following on most of the data nodes: "[EMAIL PROTECTED]" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00] at java.io.UnixFileSystem.checkAccess(Native Method) at java.io.File.canRead(File.java:660) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339 ) - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535) at java.lang.Thread.run(Thread.java:595) I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level. Now the question is - am I missing something here or this check should really be removed? Best regards, Igor Bolotin www.collarity.com