[jira] Updated: (HADOOP-4682) Improve dfs_getattr running time

Marc-Olivier Fleury (JIRA) Tue, 18 Nov 2008 14:02:37 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Marc-Olivier Fleury updated HADOOP-4682:
----------------------------------------

    Summary: Improve dfs_getattr running time  (was: Improve ddfs_getattr 
running time)

> Improve dfs_getattr running time
> --------------------------------
>
>                 Key: HADOOP-4682
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4682
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.20.0
>            Reporter: Marc-Olivier Fleury
>
> As explained in  issue  HADOOP-3797, stat takes a long time to execute. 
> I got a clearer idea of the time needed when testing a c program that needed 
> to crawl a directory tree, that contains 10s of directories and 100K files. 
> The original version used stat() to make the difference between files an 
> folders. It needed about 1h to complete. I corrected it to use dirent.d_type, 
> which provides the same information and is available at no extra cost when 
> using readdir. The execution time changed to 2-3 mins.
> I tried to do other benchmarks using ls with or without color, and on the 
> local file system, I got a speedup of 1.3, while on hdfs, the speedup was of 
> 5.7. This means (very roughly) that calling stat with fuse is 5.7/1.3 = 4.4 
> times slower.
> When using application that rely on stat to work correctly (there is 
> sometimes no other way to make the difference between a file and a folder), 
> this can be a major source of delay. The application I am working on needs to 
> stat about 30'000 files; a faster stat() function would save me hours (per 
> task).
> I am sure that I am not the only one who would appreciate a speedup, so I 
> suppose this issue should be put into consideration.
> I do not know if the bottleneck is the call to hdfsGetPathInfo or to 
> doConnectAsUser, but if it comes from doConnectAsUser, some improvements can 
> surely be made.
> And in the worst case, caching might help, as suggested in HADOOP-3797.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4682) Improve dfs_getattr running time

Reply via email to