[ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15438128#comment-15438128 ]
Sean Mackrory commented on HDFS-10797: -------------------------------------- To reproduce the discrepancy you can follow the following procedure. I put a 100 MB file into HDFS and snapshot it (hadoop fs -du -s reports 100 MB * replication after both operations), and then append another 100 MB onto it (hadoop fs -du -s will report 200 MB * replication factor at that point). If I move the file to trash or simply rename it, hadoop fs -du -s starts reporting 300 MB * replication factor in the second column. I believe at this point it is counting some of the overlap in block between the snapshot and the regular file twice, because it views the move operation the same as a delete, but since the file wasn't actually deleted it gets counted again. {quote} dd if=/dev/zero of=100MB.zero bs=10000 count=10000 bin/hadoop fs -mkdir -p /user/sean bin/hadoop fs -chown sean /user/sean bin/hadoop fs -put 100MB.zero /user/sean/HDFS-10797 bin/hdfs dfsadmin -allowSnapshot /user/sean bin/hdfs dfs -createSnapshot /user/sean s1 bin/hadoop fs -appendToFile 100MB.zero /user/sean/HDFS-10797 bin/hadoop fs -du -s /user/sean bin/hadoop fs -rm /user/sean/HDFS-10797 # or simply rename with mv bin/hadoop fs -du -s /user/sean {quote} > Disk usage summary of snapshots causes renamed blocks to get counted twice > -------------------------------------------------------------------------- > > Key: HDFS-10797 > URL: https://issues.apache.org/jira/browse/HDFS-10797 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Sean Mackrory > > DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how > much disk usage is used by a snapshot by tallying up the files in the > snapshot that have since been deleted (that way it won't overlap with regular > files whose disk usage is computed separately). However that is determined > from a diff that shows moved (to Trash or otherwise) or renamed files as a > deletion and a creation operation that may overlap with the list of blocks. > Only the deletion operation is taken into consideration, and this causes > those blocks to get represented twice in the disk usage tallying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org