I agree that it is too expensive to call checkDir for every I/O operation. But checkDir is usefully in the case that a disk suddenly became only readable. We saw this happened before. But we definitely should revisit it since now a datanode is able to manage multiple data directories and each data directory maitain multiple levels.
Hairong -----Original Message----- From: Raghu Angadi [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 27, 2007 7:58 PM To: hadoop-dev@lucene.apache.org Subject: Re: Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect We should file a Jira on it. I agree CheckDir() is called too many times and too expensive. I don't think it serves anything important or essential purpose. I vote for removing it. Anyone ever seen this check fail and/or the failure being useful for cluster functionality? Raghu. Igor Bolotin wrote: > While investigating performance issues in our Hadoop DFS/MapReduce > cluster I saw very high CPU usage by DataNode processes. > > Stack trace showed following on most of the data nodes: > > > > "[EMAIL PROTECTED]" daemon > prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable > [0x000000004166a000..0x000000004166ac00] > > at java.io.UnixFileSystem.checkAccess(Native Method) > > at java.io.File.canRead(File.java:660) > > at > org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) > > at > org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258) > > at > org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:3 > 39 > ) > > - locked <0x00002aaab6fb8960> (a > org.apache.hadoop.dfs.FSDataset$FSVolumeSet) > > at > org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544) > > at > org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535) > > at java.lang.Thread.run(Thread.java:595) > > > > I understand that it would take a while to check the entire data > directory - as we have some 180,000 blocks/files in there. But what > really bothers me that from the code I see that this check is executed > for every client connection to the DataNode - which also means for > every task executed in the cluster. Once I commented out the check and > restarted datanodes - the performance went up and CPU usage went down > to reasonable level. > > > > Now the question is - am I missing something here or this check should > really be removed? > > > > Best regards, > > Igor Bolotin > www.collarity.com