DistributedFileSystem#listStatus is very slow when listing a directory with a 
size of 1300
------------------------------------------------------------------------------------------

                 Key: HADOOP-6502
                 URL: https://issues.apache.org/jira/browse/HADOOP-6502
             Project: Hadoop Common
          Issue Type: Bug
          Components: util
    Affects Versions: 0.20.0
            Reporter: Hairong Kuang
            Priority: Critical
             Fix For: 0.20.2, 0.21.0, 0.22.0


When listing a directory of around 1300 children, it takes hundreds of 
milliseconds. It turns out the slowdowness is caused by the change made by 
HADOOP-4187. The return value of listStatus is an array of FileStatus. When 
serializing each element of the array, ReflectionUtils#newInstance(Class<T>, 
Configuration) is called and then calls setConf, which calls setJobConf. 
SetJobConf checks if JobConf is on the class path by calling 
Configuration#getClassByName. Even though Configuration#getClassByName tries to 
optimize the lookup using a cached map, but since JobConf is not in the class 
path, so it is not in the cache. Every checkup ends up calling Class.ForName 
which is very expensive. Deserializing an array of 1300 entries requires 
calling of Class#ForName 1300 times!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to