sodonnel commented on PR #5460: URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1462908624
> @sodonnel also I am curious, was it just a specific log (only one case e.g. the lease was expired) or combination of logs from `checkLease(DatanodeDescriptor dn, long monotonicNowMs, long id)` that you have seen in various issues? > > I wonder if `lease expiry` or `invalid lease` are worth having some dedicated metrics in `NameNodeActivity` (maybe not as with this patch, the subsequent attempt by BP actor should anyways have new lease id acquired from the response of heartbeat API before it reattempts sending FBR). In the examples I saw, its was expired leases that caused the problem. However the namenode was under significant pressure when it happened. In one example, it was actually the SBNN which was rejecting the reports. Tailing the edits was taking frequent long locks (over 300 seconds at time) which was beyond the lease expiry. In another example, it was the ANN after startup. I am not sure, but I think the system perhaps out of safemode with many block reports still outstanding, and then between under replication and IBRs, contention on the NN lock seemed to block the FBRs until the lease expired. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
