[ 
https://issues.apache.org/jira/browse/HDFS-11821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010913#comment-16010913
 ] 

Wellington Chevreuil commented on HDFS-11821:
---------------------------------------------

I had reviewed the failed tests, but it does not seem related to the code 
changed on this patch. This are not failing on my local builder either. Please 
let me know on any suggestions and/or possible improvements.

> BlockManager.getMissingReplOneBlocksCount() does not report correct value if 
> corrupt file with replication factor of 1 gets deleted
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11821
>                 URL: https://issues.apache.org/jira/browse/HDFS-11821
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.6.0, 3.0.0-alpha2
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: HDFS-11821-1.patch, HDFS-11821-2.patch
>
>
> *BlockManager* keeps a separate metric for number of missing blocks with 
> replication factor of 1. This is returned by 
> *BlockManager.getMissingReplOneBlocksCount()* method currently, and that's 
> what is displayed on below attribute for *dfsadmin -report* (in below 
> example, there's one corrupt block that relates to a file with replication 
> factor of 1):
> {noformat}
> ...
> Missing blocks (with replication factor 1): 1
> ...
> {noformat}
> However, if the related file gets deleted, (for instance, using hdfs fsck 
> -delete option), this metric never gets updated, and *dfsadmin -report* will 
> keep reporting a missing block, even though the file does not exist anymore. 
> The only workaround available is to restart the NN, so that this metric will 
> be cleared.
> This can be easily reproduced by forcing a replication factor 1 file 
> corruption such as follows:
> 1) Put a file into hdfs with replication factor 1:
> {noformat}
> $ hdfs dfs -Ddfs.replication=1 -put test_corrupt /
> $ hdfs dfs -ls /
> -rw-r--r--   1 hdfs     supergroup         19 2017-05-10 09:21 /test_corrupt
> {noformat}
> 2) Find related block for the file and delete it from DN:
> {noformat}
> $ hdfs fsck /test_corrupt -files -blocks -locations
> ...
> /test_corrupt 19 bytes, 1 block(s):  OK
> 0. BP-782213640-172.31.113.82-1494420317936:blk_1073742742_1918 len=19 
> Live_repl=1 
> [DatanodeInfoWithStorage[172.31.112.178:20002,DS-a0dc0b30-a323-4087-8c36-26ffdfe44f46,DISK]]
> Status: HEALTHY
> ...
> $ find /dfs/dn/ -name blk_1073742742*
> /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742
> /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta
> $ rm -rf 
> /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742
> $ rm -rf 
> /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta
> {noformat}
> 3) Running fsck will report the corruption as expected:
> {noformat}
> $ hdfs fsck /test_corrupt -files -blocks -locations
> ...
> /test_corrupt 19 bytes, 1 block(s): 
> /test_corrupt: CORRUPT blockpool BP-782213640-172.31.113.82-1494420317936 
> block blk_1073742742
>  MISSING 1 blocks of total size 19 B
> ...
> Total blocks (validated):     1 (avg. block size 19 B)
>   ********************************
>   UNDER MIN REPL'D BLOCKS:    1 (100.0 %)
>   dfs.namenode.replication.min:       1
>   CORRUPT FILES:      1
>   MISSING BLOCKS:     1
>   MISSING SIZE:               19 B
>   CORRUPT BLOCKS:     1
> ...
> {noformat}
> 4) Same for *dfsadmin -report*
> {noformat}
> $ hdfs dfsadmin -report
> ...
> Under replicated blocks: 1
> Blocks with corrupt replicas: 0
> Missing blocks: 1
> Missing blocks (with replication factor 1): 1
> ...
> {noformat}
> 5) Running *fsck -delete* option does cause fsck to report correct 
> information about corrupt block, but dfsadmin still shows the corrupt block:
> {noformat}
> $ hdfs fsck /test_corrupt -delete
> ...
> $ hdfs fsck /
> ...
> The filesystem under path '/' is HEALTHY
> ...
> $ hdfs dfsadmin -report
> ...
> Under replicated blocks: 0
> Blocks with corrupt replicas: 0
> Missing blocks: 0
> Missing blocks (with replication factor 1): 1
> ...
> {noformat}
> The problem seems to be on *BlockManager.removeBlock()* method, which in turn 
> uses util class *LowRedundancyBlocks* that classifies blocks according to the 
> current replication level, including blocks currently marked as corrupt. 
> The related metric showed on *dfsadmin -report* for corrupt blocks with 
> replication factor 1 is tracked on this *LowRedundancyBlocks*. Whenever a 
> block is marked as corrupt and it has replication factor of 1, the related 
> metric is updated. When removing the block, though, 
> *BlockManager.removeBlock()* is calling *LowRedundancyBlocks.remove(BlockInfo 
> block, int priLevel)*, which does not check if the given block was previously 
> marked as corrupt and had replication factor 1, which would require for 
> updating the metric.
> Am shortly proposing a patch that seems to fix this by making 
> *BlockManager.removeBlock()*  call *LowRedundancyBlocks.remove(BlockInfo 
> block, int oldReplicas, int oldReadOnlyReplicas, int outOfServiceReplicas, 
> int oldExpectedReplicas)* instead, which does update the metric properly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to