[jira] [Updated] (HDFS-15589) Huge PostponedMisreplicatedBlocks can't decrease immediately when start namenode after datanode

2020-09-21 Thread zhengchenyu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated HDFS-15589:
---
Description: 
In our test cluster, I restart my namenode. Then I found many 
PostponedMisreplicatedBlocks which doesn't decrease immediately. 

I search the log below like this. 
{code:java}
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=c6a9934f-afd4-4437-b976-fed55173ce57, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=aee144f1-2082-4bca-a92b-f3c154a71c65, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=d152fa5b-1089-4bfc-b9c4-e3a7d98c7a7b, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,156 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=5cffc1fe-ace9-4af8-adfc-6002a7f5565d, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,161 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=9980d8e1-b0d9-4657-b97d-c803f82c1459, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,197 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=77ff3f5e-37f0-405f-a16c-166311546cae, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12

{code}
Node: test cluster only have 6 datanode.

You will see the blockreport called before "Marking all datanodes as stale" 
which is logged by startActiveServices. But 
DatanodeStorageInfo.blockContentsStale only set to false in blockreport, then 
startActiveServices set all datnaode to stale node. So the datanodes will keep 
stale util next blockreport, then PostponedMisreplicatedBlocks keep a huge 
number.

  was:
In our test cluster, I restart my namenode. Then I found many 
PostponedMisreplicatedBlocks which doesn't decrease immediately. 

I search the log below like this. 
{code:java}
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=c6a9934f-afd4-4437-b976-fed55173ce57, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=aee144f1-2082-4bca-a92b-f3c154a71c65, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=d152fa5b-1089-4bfc-b9c4-e3a7d98c7a7b, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,156 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=5cffc1fe-ace9-4af8-adfc-6002a7f5565d, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,161 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=9980d8e1-b0d9-4657-b97d-c803f82c1459, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,197 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=77ff3f5e-37f0-405f-a16c-166311546cae, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 

[jira] [Updated] (HDFS-15589) Huge PostponedMisreplicatedBlocks can't decrease immediately when start namenode after datanode

2020-09-21 Thread zhengchenyu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated HDFS-15589:
---
Description: 
In our test cluster, I restart my namenode. Then I found many 
PostponedMisreplicatedBlocks which doesn't decrease immediately. 

I search the log below like this. 
{code:java}
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=c6a9934f-afd4-4437-b976-fed55173ce57, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=aee144f1-2082-4bca-a92b-f3c154a71c65, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=d152fa5b-1089-4bfc-b9c4-e3a7d98c7a7b, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,156 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=5cffc1fe-ace9-4af8-adfc-6002a7f5565d, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,161 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=9980d8e1-b0d9-4657-b97d-c803f82c1459, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,197 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=77ff3f5e-37f0-405f-a16c-166311546cae, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12

{code}
Node: test cluster only have 6 datanode.

You will see the blockreport called before "Marking all datanodes as stale" 
which is logged by startActiveServices. But 
DatanodeStorageInfo.blockContentsStale only set to false in blockreport, then 
startActiveServices set all datnaode to stale node. So the datanodes will keep 
stale util next blockreport.

  was:
In our test cluster, I restart my namenode. Then I found many 
PostponedMisreplicatedBlocks which doesn't decrease immediately. 

I search the log below like this. 

{code}

2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=c6a9934f-afd4-4437-b976-fed55173ce57, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=aee144f1-2082-4bca-a92b-f3c154a71c65, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=d152fa5b-1089-4bfc-b9c4-e3a7d98c7a7b, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,156 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=5cffc1fe-ace9-4af8-adfc-6002a7f5565d, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,161 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=9980d8e1-b0d9-4657-b97d-c803f82c1459, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
 reports.length=12
2020-09-21 17:02:37,197 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
from DatanodeRegistration(xx.xx.xx.xx:9866, 
datanodeUuid=77ff3f5e-37f0-405f-a16c-166311546cae, infoPort=9864, 
infoSecurePort=0, ipcPort=9867,