[ https://issues.apache.org/jira/browse/HDFS-11821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154536#comment-16154536 ]
Ravi Prakash commented on HDFS-11821: ------------------------------------- My concern with your patch is that remove will now be a bit slower. I think I remember there used to be a time when deletes were holding up the lock for a long time. [~kihwal] Do you have an objection? I'm also wondering what happens when the information returned by {{countNodes}} is inaccurate (i.e. HDFS hasn't yet realized that the block is corrupt) Also, the test failures seem related. > BlockManager.getMissingReplOneBlocksCount() does not report correct value if > corrupt file with replication factor of 1 gets deleted > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-11821 > URL: https://issues.apache.org/jira/browse/HDFS-11821 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 2.6.0, 3.0.0-alpha2 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Minor > Attachments: HDFS-11821-1.patch, HDFS-11821-2.patch > > > *BlockManager* keeps a separate metric for number of missing blocks with > replication factor of 1. This is returned by > *BlockManager.getMissingReplOneBlocksCount()* method currently, and that's > what is displayed on below attribute for *dfsadmin -report* (in below > example, there's one corrupt block that relates to a file with replication > factor of 1): > {noformat} > ... > Missing blocks (with replication factor 1): 1 > ... > {noformat} > However, if the related file gets deleted, (for instance, using hdfs fsck > -delete option), this metric never gets updated, and *dfsadmin -report* will > keep reporting a missing block, even though the file does not exist anymore. > The only workaround available is to restart the NN, so that this metric will > be cleared. > This can be easily reproduced by forcing a replication factor 1 file > corruption such as follows: > 1) Put a file into hdfs with replication factor 1: > {noformat} > $ hdfs dfs -Ddfs.replication=1 -put test_corrupt / > $ hdfs dfs -ls / > -rw-r--r-- 1 hdfs supergroup 19 2017-05-10 09:21 /test_corrupt > {noformat} > 2) Find related block for the file and delete it from DN: > {noformat} > $ hdfs fsck /test_corrupt -files -blocks -locations > ... > /test_corrupt 19 bytes, 1 block(s): OK > 0. BP-782213640-172.31.113.82-1494420317936:blk_1073742742_1918 len=19 > Live_repl=1 > [DatanodeInfoWithStorage[172.31.112.178:20002,DS-a0dc0b30-a323-4087-8c36-26ffdfe44f46,DISK]] > Status: HEALTHY > ... > $ find /dfs/dn/ -name blk_1073742742* > /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742 > /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta > $ rm -rf > /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742 > $ rm -rf > /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta > {noformat} > 3) Running fsck will report the corruption as expected: > {noformat} > $ hdfs fsck /test_corrupt -files -blocks -locations > ... > /test_corrupt 19 bytes, 1 block(s): > /test_corrupt: CORRUPT blockpool BP-782213640-172.31.113.82-1494420317936 > block blk_1073742742 > MISSING 1 blocks of total size 19 B > ... > Total blocks (validated): 1 (avg. block size 19 B) > ******************************** > UNDER MIN REPL'D BLOCKS: 1 (100.0 %) > dfs.namenode.replication.min: 1 > CORRUPT FILES: 1 > MISSING BLOCKS: 1 > MISSING SIZE: 19 B > CORRUPT BLOCKS: 1 > ... > {noformat} > 4) Same for *dfsadmin -report* > {noformat} > $ hdfs dfsadmin -report > ... > Under replicated blocks: 1 > Blocks with corrupt replicas: 0 > Missing blocks: 1 > Missing blocks (with replication factor 1): 1 > ... > {noformat} > 5) Running *fsck -delete* option does cause fsck to report correct > information about corrupt block, but dfsadmin still shows the corrupt block: > {noformat} > $ hdfs fsck /test_corrupt -delete > ... > $ hdfs fsck / > ... > The filesystem under path '/' is HEALTHY > ... > $ hdfs dfsadmin -report > ... > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor 1): 1 > ... > {noformat} > The problem seems to be on *BlockManager.removeBlock()* method, which in turn > uses util class *LowRedundancyBlocks* that classifies blocks according to the > current replication level, including blocks currently marked as corrupt. > The related metric showed on *dfsadmin -report* for corrupt blocks with > replication factor 1 is tracked on this *LowRedundancyBlocks*. Whenever a > block is marked as corrupt and it has replication factor of 1, the related > metric is updated. When removing the block, though, > *BlockManager.removeBlock()* is calling *LowRedundancyBlocks.remove(BlockInfo > block, int priLevel)*, which does not check if the given block was previously > marked as corrupt and had replication factor 1, which would require for > updating the metric. > Am shortly proposing a patch that seems to fix this by making > *BlockManager.removeBlock()* call *LowRedundancyBlocks.remove(BlockInfo > block, int oldReplicas, int oldReadOnlyReplicas, int outOfServiceReplicas, > int oldExpectedReplicas)* instead, which does update the metric properly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org