[ 
https://issues.apache.org/jira/browse/HDFS-15589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199750#comment-17199750
 ] 

zhengchenyu commented on HDFS-15589:
------------------------------------

[~ayushtkn] I know the postpone block's logical. I encounter a case, maybe a 
low probability case. Now we describe this logical simply:

(1) When namenode transient from standby to active, namenode will label all 
DatanodeDescriptor be stale for aviod to delete some possible deleted block.

(2) Then datanode blockreport to namenode, then set DatanodeDescriptor to not 
stale. Then some over replicate block could be delete.

But if (2) happend before (1), the DatanodeDescriptor will keep stale util next 
blockreport, you know blockreport is low frequency rpc operaiton. So 
PostponedMisreplicatedBlocks will keep huge number for long time.

> Huge PostponedMisreplicatedBlocks can't decrease immediately when start 
> namenode after datanode
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15589
>                 URL: https://issues.apache.org/jira/browse/HDFS-15589
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>         Environment: CentOS 7
>            Reporter: zhengchenyu
>            Priority: Major
>
> In our test cluster, I restart my namenode. Then I found many 
> PostponedMisreplicatedBlocks which doesn't decrease immediately. 
> I search the log below like this. 
> {code:java}
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=c6a9934f-afd4-4437-b976-fed55173ce57, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=aee144f1-2082-4bca-a92b-f3c154a71c65, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=d152fa5b-1089-4bfc-b9c4-e3a7d98c7a7b, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,156 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=5cffc1fe-ace9-4af8-adfc-6002a7f5565d, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,161 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=9980d8e1-b0d9-4657-b97d-c803f82c1459, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,197 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=77ff3f5e-37f0-405f-a16c-166311546cae, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> {code}
> Node: test cluster only have 6 datanode.
> You will see the blockreport called before "Marking all datanodes as stale" 
> which is logged by startActiveServices. But 
> DatanodeStorageInfo.blockContentsStale only set to false in blockreport, then 
> startActiveServices set all datnaode to stale node. So the datanodes will 
> keep stale util next blockreport, then PostponedMisreplicatedBlocks keep a 
> huge number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to