[jira] [Commented] (HDFS-15589) Huge PostponedMisreplicatedBlocks can't decrease immediately when start namenode after datanode

zhengchenyu (Jira) Tue, 22 Sep 2020 00:57:46 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199910#comment-17199910
 ]


zhengchenyu commented on HDFS-15589:
------------------------------------

[~hexiaoqiao]
Yes, in theroy, postponedMisreplicatedBlocks only compat fuction 
'rescanPostponedMisreplicatedBlocks', and it use namesystem's writeLock, then 
may decrease namnode rpc performance. But 
dfs.namenode.blocks.per.postponedblocks.rescan’s default value is 10000, so I 
think it may result to little performance.
But let us see some log, some called wast long time.
{code}
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:20:15,429 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 65 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:20:18,496 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 64 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:20:23,958 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 2459 msecs. 19916 blocks 
are left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:20:27,023 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 60 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:20:30,088 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 61 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:20:33,149 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 58 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:20:47,890 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 5140 msecs. 19916 blocks 
are left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:32:36,458 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 110 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:32:39,529 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 70 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:32:42,596 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 66 msecs. 19916 blocks are 
left. 0 blocks were removed.
hadoop-hdfs-namenode-bd-tz-hadoop-001012.ke.com.log.info.9:2020-09-21 
15:32:45,665 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
Rescan of postponedMisreplicatedBlocks completed in 65 msecs. 19916 blocks are 
left. 0 blocks were removed.
{code}
In fact, it found in our test cluster, a very small cluster, can't detect 
performace. But why I pay attention to this problem? My last comanpy, some day 
postponedMisreplicatedBlocks increase huge, then namenode rpc performane 
decrease. Then some hours laster, postponedMisreplicatedBlocks decrease, the 
namenode be well again. At that moment, I focus on yarn, so I didn't research 
the namenode log, and then no real truth. 

> Huge PostponedMisreplicatedBlocks can't decrease immediately when start 
> namenode after datanode
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15589
>                 URL: https://issues.apache.org/jira/browse/HDFS-15589
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>         Environment: CentOS 7
>            Reporter: zhengchenyu
>            Priority: Major
>
> In our test cluster, I restart my namenode. Then I found many 
> PostponedMisreplicatedBlocks which doesn't decrease immediately. 
> I search the log below like this. 
> {code:java}
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=c6a9934f-afd4-4437-b976-fed55173ce57, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=aee144f1-2082-4bca-a92b-f3c154a71c65, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=d152fa5b-1089-4bfc-b9c4-e3a7d98c7a7b, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,156 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=5cffc1fe-ace9-4af8-adfc-6002a7f5565d, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,161 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=9980d8e1-b0d9-4657-b97d-c803f82c1459, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,197 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=77ff3f5e-37f0-405f-a16c-166311546cae, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> {code}
> Node: test cluster only have 6 datanode.
> You will see the blockreport called before "Marking all datanodes as stale" 
> which is logged by startActiveServices. But 
> DatanodeStorageInfo.blockContentsStale only set to false in blockreport, then 
> startActiveServices set all datnaode to stale node. So the datanodes will 
> keep stale util next blockreport, then PostponedMisreplicatedBlocks keep a 
> huge number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15589) Huge PostponedMisreplicatedBlocks can't decrease immediately when start namenode after datanode

Reply via email to