[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750116#comment-17750116
 ] 

ASF GitHub Bot commented on HDFS-17093:
---------------------------------------

Tre2878 commented on PR #5855:
URL: https://github.com/apache/hadoop/pull/5855#issuecomment-1661545892

   > > Two of the cases will come into this logic
   > > 
   > > 1. The namenode is restarting while receiving FBRS from all datanodes 
and is in safe mode
   > > 2. When the namenode is in secure mode for some reason while it has been 
running for a long time
   > >    In the first case, if the datanode has a failed disk, the datanode 
will send the FBR for the normal disk and the namenode will handle it normally
   > >    In the second case, blockReportCount == 0 will always be false if no 
new disks are added to the datanode
   > >    So I recommend keeping the code as it is and not using 
blockReportCount == 0
   > 
   > @Tre2878 If a disk is failed, its' state will be set to `FAILED` and it 
will be removed from `storageMap`. So the check of `blockReportCount` will not 
involve failed storages, and `blockReportCount == 0` is not always be false. 
Thus, I think my plan can work here. What's your opinion?
   You're talking about the first case I mentioned, which is fine,
   The problem is the second case, where the namenode is not restarted, but the 
namenode is in safe mode for some reason, and this logic will also be executed, 
and since this is not the first FBR, blockReportCount will always be greater 
than 0
   




> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17093
>                 URL: https://issues.apache.org/jira/browse/HDFS-17093
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.3.4
>            Reporter: Yanlei Yu
>            Priority: Minor
>              Labels: pull-request-available
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> ....
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to