Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency
--------------------------------------------------------------------
Key: HADOOP-4339
URL: https://issues.apache.org/jira/browse/HADOOP-4339
Project: Hadoop Core
Issue Type: Bug
Components: fs
Affects Versions: 0.18.1
Reporter: David Phillips
FsShell.du has two inefficiencies:
* calling getContentSummary twice for each top-level item rather than calling
it once and saving the result
* calling getContentSummary for files rather than using the size it already has
in FileStatus
getContentSummary has one:
* calling itself for files rather than using the length it already has in
FileStatus
Every call to getContentSummary results in a call to getFileStatus, which may
be expensive (e.g. NativeS3FileSystem has both network latency and actual
monetary cost).
The simple solution:
* FsShell.du calls once per item and saves the ContentSummary
* FsShell.du uses FileStatus.getLen for files
* getContentSummary only calls itself for directories
Another solution, rather than adding special casing to callers, is to add a
getContentSummary that takes a FileStatus.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.