[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697809#comment-17697809
 ] 

ASF GitHub Bot commented on HDFS-16942:
---------------------------------------

sodonnel commented on code in PR #5460:
URL: https://github.com/apache/hadoop/pull/5460#discussion_r1129184134


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java:
##########
@@ -791,6 +792,9 @@ private void offerService() throws Exception {
           shouldServiceRun = false;
           return;
         }
+        if (InvalidBlockReportLeaseException.class.getName().equals(reClass)) {
+          fullBlockReportLeaseId = 0;

Review Comment:
   At line 717, we can see where it attempts to get a lease from the heartbeat 
if the lease in the DN == 0:
   
   ```
      boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
                     scheduler.isBlockReportDue(startTime);
   ```
   
   So its the isBlockReportDue that controls this. Then later, if we have a non 
zero lease, it will try to create the block report:
   
   ```
           boolean forceFullBr =
               scheduler.forceFullBlockReport.getAndSet(false);
           if (forceFullBr) {
             LOG.info("Forcing a full block report to " + nnAddr);
           }
           if ((fullBlockReportLeaseId != 0) || forceFullBr) {
             cmds = blockReport(fullBlockReportLeaseId);
             fullBlockReportLeaseId = 0;
           }
   ```
   Its really the `isBlockReportDue()` method that controls whether a new one 
should be sent of not, and that is based on time since the last one. The the 
`blockReport()`, it updates the time after a successful block report, but if it 
gets an exception, like this change causes, it will not update the time and so 
it will try again on the next heartbeat if it gets a new lease.
   
   I think `forceFullBlockReport` is only for tests, or the command to force a 
DN block from the CLI.





> Send error to datanode if FBR is rejected due to bad lease
> ----------------------------------------------------------
>
>                 Key: HDFS-16942
>                 URL: https://issues.apache.org/jira/browse/HDFS-16942
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to