[ 
https://issues.apache.org/jira/browse/HDFS-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-8986:
----------------------------
    Attachment: HDFS-8986.01.patch

I'm attaching a preliminary patch 1 to show the general idea of this change.

I think Chris' comment of modifying {{-count}} together with {{-du}} makes 
perfect sense, since they are the 2 usages of {{FileSystem#getContentSummary}} 
in the shell. (Except for {{-rm}}'s internal usage, which is not in the scope 
of this exclude snapshot discussion.)

The general idea is to add a {{-x}} flag to the shell commands, to exclude 
snapshots from calculation.

Implementation-wise, since {{getContentSummary}} is a public API on 
{{FileSystem}}, I don't think we can change anything there. Alternatively, I 
added fields and methods to {{ContentSummary}} (which is 
{{InterfaceStability.Evolving}} - and I think the changes here doesn't break 
compatibility anyway), to store snapshot-related values in the same object. 
Then at the caller of {{getContentSummary}}, one can subtract the 
snapshot-related values from the total, to get the desired {{-x}} result.
The calculation is done by adding a snapshot specific {{ContentCounts}} to the 
{{ContentSummaryComputationContext}} object.

Please review and provide feedback regarding this approach. I plan to polish 
the tests/docs in a later rev. Thanks very much!

> Add option to -du to calculate directory space usage excluding snapshots
> ------------------------------------------------------------------------
>
>                 Key: HDFS-8986
>                 URL: https://issues.apache.org/jira/browse/HDFS-8986
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: snapshots
>            Reporter: Gautam Gopalakrishnan
>            Assignee: Xiao Chen
>         Attachments: HDFS-8986.01.patch
>
>
> When running {{hadoop fs -du}} on a snapshotted directory (or one of its 
> children), the report includes space consumed by blocks that are only present 
> in the snapshots. This is confusing for end users.
> {noformat}
> $  hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 799.7 M  2.3 G  /tmp/parent
> 799.7 M  2.3 G  /tmp/parent/sub1
> $ hdfs dfs -createSnapshot /tmp/parent snap1
> Created snapshot /tmp/parent/.snapshot/snap1
> $ hadoop fs -rm -skipTrash /tmp/parent/sub1/*
> ...
> $ hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 799.7 M  2.3 G  /tmp/parent
> 799.7 M  2.3 G  /tmp/parent/sub1
> $ hdfs dfs -deleteSnapshot /tmp/parent snap1
> $ hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 0  0  /tmp/parent
> 0  0  /tmp/parent/sub1
> {noformat}
> It would be helpful if we had a flag, say -X, to exclude any snapshot related 
> disk usage in the output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to