We should file a Jira on it.
I agree CheckDir() is called too many times and too expensive. I don't
think it serves anything important or essential purpose. I vote for
removing it. Anyone ever seen this check fail and/or the failure being
useful for cluster functionality?
Raghu.
Igor Bolotin wrote:
While investigating performance issues in our Hadoop DFS/MapReduce
cluster I saw very high CPU usage by DataNode processes.
Stack trace showed following on most of the data nodes:
"[EMAIL PROTECTED]" daemon prio=1
tid=0x00002aaacb5b7bd0 nid=0x5940 runnable
[0x000000004166a000..0x000000004166ac00]
at java.io.UnixFileSystem.checkAccess(Native Method)
at java.io.File.canRead(File.java:660)
at
org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at
org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
at
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339
)
- locked <0x00002aaab6fb8960> (a
org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
at
org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
at
org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
at java.lang.Thread.run(Thread.java:595)
I understand that it would take a while to check the entire data
directory - as we have some 180,000 blocks/files in there. But what
really bothers me that from the code I see that this check is executed
for every client connection to the DataNode - which also means for every
task executed in the cluster. Once I commented out the check and
restarted datanodes - the performance went up and CPU usage went down to
reasonable level.
Now the question is - am I missing something here or this check should
really be removed?
Best regards,
Igor Bolotin
www.collarity.com