Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
--------------------------------------------------------------------

                 Key: HADOOP-4339
                 URL: https://issues.apache.org/jira/browse/HADOOP-4339
             Project: Hadoop Core
          Issue Type: Bug
          Components: fs
    Affects Versions: 0.18.1
            Reporter: David Phillips


FsShell.du has two inefficiencies:

* calling getContentSummary twice for each top-level item rather than calling 
it once and saving the result
* calling getContentSummary for files rather than using the size it already has 
in FileStatus

getContentSummary has one:

* calling itself for files rather than using the length it already has in 
FileStatus

Every call to getContentSummary results in a call to getFileStatus, which may 
be expensive (e.g. NativeS3FileSystem has both network latency and actual 
monetary cost).

The simple solution:

* FsShell.du calls once per item and saves the ContentSummary
* FsShell.du uses FileStatus.getLen for files
* getContentSummary only calls itself for directories

Another solution, rather than adding special casing to callers, is to add a 
getContentSummary that takes a FileStatus.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to