[ 
https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15531199#comment-15531199
 ] 

Sean Mackrory commented on HDFS-10797:
--------------------------------------

Thanks for pointing that out [~jingzhao]. I added test cases to address some 
inter-directory renames. Of course, some of them are broken and still reported 
the wrong usage. I'd really like to come up with a way for the semantics to be 
both consistent and unsurprising to a user. I improved the situation somewhat 
by computing which nodes were deleted (as opposed to renames) in the context of 
all the diffs for a directory instead of each diff individually. So it's a step 
in the right direction but the real fix would be to have some global context 
when computing usage that ensures each INode in the hierarchy is counted 
exactly once. It looks to me like that's going to require some refactoring, 
since although the counts are cumulative, they can accumulate in multiple 
distinct objects before being combined. We would need to refactor some 
functions that so all counts were added directly to a single object, and that 
same object could prevent nodes from being counted twice, once because they 
were removed from a snapshotted directory, and again because of where they 
reside now.

Thoughts on this approach before I go further?

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --------------------------------------------------------------------------
>
>                 Key: HDFS-10797
>                 URL: https://issues.apache.org/jira/browse/HDFS-10797
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, 
> HDFS-10797.003.patch
>
>
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how 
> much disk usage is used by a snapshot by tallying up the files in the 
> snapshot that have since been deleted (that way it won't overlap with regular 
> files whose disk usage is computed separately). However that is determined 
> from a diff that shows moved (to Trash or otherwise) or renamed files as a 
> deletion and a creation operation that may overlap with the list of blocks. 
> Only the deletion operation is taken into consideration, and this causes 
> those blocks to get represented twice in the disk usage tallying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to