Improve ddfs_getattr running time
---------------------------------

                 Key: HADOOP-4682
                 URL: https://issues.apache.org/jira/browse/HADOOP-4682
             Project: Hadoop Core
          Issue Type: Improvement
    Affects Versions: 0.20.0
            Reporter: Marc-Olivier Fleury


As explained in  issue  HADOOP-3797, stat takes a long time to execute. 

I got a clearer idea of the time needed when testing a c program that needed to 
crawl a directory tree, that contains 10s of directories and 100K files. The 
original version used stat() to make the difference between files an folders. 
It needed about 1h to complete. I corrected it to use dirent.d_type, which 
provides the same information and is available at no extra cost when using 
readdir. The execution time changed to 2-3 mins.

I tried to do other benchmarks using ls with or without color, and on the local 
file system, I got a speedup of 1.3, while on hdfs, the speedup was of 5.7. 
This means (very roughly) that calling stat with fuse is 5.7/1.3 = 4.4 times 
slower.

When using application that rely on stat to work correctly (there is sometimes 
no other way to make the difference between a file and a folder), this can be a 
major source of delay. The application I am working on needs to stat about 
30'000 files; a faster stat() function would save me hours (per task).

I am sure that I am not the only one who would appreciate a speedup, so I 
suppose this issue should be put into consideration.

I do not know if the bottleneck is the call to hdfsGetPathInfo or to 
doConnectAsUser, but if it comes from doConnectAsUser, some improvements can 
surely be made.

And in the worst case, caching might help, as suggested in HADOOP-3797.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to