[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698631#comment-17698631
 ] 

ASF GitHub Bot commented on HDFS-16942:
---------------------------------------

sodonnel commented on PR #5460:
URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1462908624

   > @sodonnel also I am curious, was it just a specific log (only one case 
e.g. the lease was expired) or combination of logs from 
`checkLease(DatanodeDescriptor dn, long monotonicNowMs, long id)` that you have 
seen in various issues?
   > 
   > I wonder if `lease expiry` or `invalid lease` are worth having some 
dedicated metrics in `NameNodeActivity` (maybe not as with this patch, the 
subsequent attempt by BP actor should anyways have new lease id acquired from 
the response of heartbeat API before it reattempts sending FBR).
   
   In the examples I saw, its was expired leases that caused the problem. 
However the namenode was under significant pressure when it happened. In one 
example, it was actually the SBNN which was rejecting the reports. Tailing the 
edits was taking frequent long locks (over 300 seconds at time) which was 
beyond the lease expiry.
   
   In another example, it was the ANN after startup. I am not sure, but I think 
the system perhaps out of safemode with many block reports still outstanding, 
and then between under replication and IBRs, contention on the NN lock seemed 
to block the FBRs until the lease expired.




> Send error to datanode if FBR is rejected due to bad lease
> ----------------------------------------------------------
>
>                 Key: HDFS-16942
>                 URL: https://issues.apache.org/jira/browse/HDFS-16942
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to