Improve ddfs_getattr running time
---------------------------------
Key: HADOOP-4682
URL: https://issues.apache.org/jira/browse/HADOOP-4682
Project: Hadoop Core
Issue Type: Improvement
Affects Versions: 0.20.0
Reporter: Marc-Olivier Fleury
As explained in issue HADOOP-3797, stat takes a long time to execute.
I got a clearer idea of the time needed when testing a c program that needed to
crawl a directory tree, that contains 10s of directories and 100K files. The
original version used stat() to make the difference between files an folders.
It needed about 1h to complete. I corrected it to use dirent.d_type, which
provides the same information and is available at no extra cost when using
readdir. The execution time changed to 2-3 mins.
I tried to do other benchmarks using ls with or without color, and on the local
file system, I got a speedup of 1.3, while on hdfs, the speedup was of 5.7.
This means (very roughly) that calling stat with fuse is 5.7/1.3 = 4.4 times
slower.
When using application that rely on stat to work correctly (there is sometimes
no other way to make the difference between a file and a folder), this can be a
major source of delay. The application I am working on needs to stat about
30'000 files; a faster stat() function would save me hours (per task).
I am sure that I am not the only one who would appreciate a speedup, so I
suppose this issue should be put into consideration.
I do not know if the bottleneck is the call to hdfsGetPathInfo or to
doConnectAsUser, but if it comes from doConnectAsUser, some improvements can
surely be made.
And in the worst case, caching might help, as suggested in HADOOP-3797.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.