[
https://issues.apache.org/jira/browse/HDFS-16379?focusedWorklogId=694597&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-694597
]
ASF GitHub Bot logged work on HDFS-16379:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 12/Dec/21 06:02
Start Date: 12/Dec/21 06:02
Worklog Time Spent: 10m
Work Description: tomscut opened a new pull request #3787:
URL: https://github.com/apache/hadoop/pull/3787
JIRA: [HDFS-16379](https://issues.apache.org/jira/browse/HDFS-16379).
Recently we encountered FBR-related problems in the production environment,
which were solved by introducing HDFS-12914 and HDFS-14314.
But there may be situations like this:
1 DN got `fullBlockReportLeaseId` via heartbeat.
2 DN trigger a blockReport, but some exception occurs (this may be rare, but
it may exist), and then DN does multiple retries without resetting
fullBlockReportLeaseId. Because fullBlockReportLeaseId is reset only if it
succeeds currently.
3 After a while, the exception is cleared, but the `fullBlockReportLeaseId`
has expired. Since NN did not throw an exception after the lease expired, the
DN considered that the blockReport was successful. So the blockReport was not
actually executed this time and needs to wait until the next time.
Therefore, should we consider resetting the `fullBlockReportLeaseId` in the
finally block? The advantage of this is that lease expiration can be avoided.
The downside is that each heartbeat will apply for a new
`fullBlockReportLeaseId` during the exception, but I think this cost is
negligible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 694597)
Time Spent: 0.5h (was: 20m)
> Reset fullBlockReportLeaseId after any exceptions
> -------------------------------------------------
>
> Key: HDFS-16379
> URL: https://issues.apache.org/jira/browse/HDFS-16379
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: tomscut
> Assignee: tomscut
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Recently we encountered FBR-related problems in the production environment,
> which were solved by introducing HDFS-12914 and HDFS-14314.
> But there may be situations like this:
> 1 DN got *fullBlockReportLeaseId* via heartbeat.
> 2 DN trigger a blockReport, but some exception occurs (this may be rare, but
> it may exist), and then DN does multiple retries *without resetting*
> {*}fullBlockReportLeaseId{*}{*}{*}. Because fullBlockReportLeaseId is reset
> only if it succeeds currently.
> 3 After a while, the exception is cleared, but the fullBlockReportLeaseId has
> expired. *Since NN did not throw an exception after the lease expired, the DN
> considered that the blockReport was successful.* So the blockReport was not
> actually executed this time and needs to wait until the next time.
> Therefore, {*}should we consider resetting the fullBlockReportLeaseId in the
> finally block{*}? The advantage of this is that lease expiration can be
> avoided. The downside is that each heartbeat will apply for a new
> fullBlockReportLeaseId during the exception, but I think this cost is
> negligible.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]