[
https://issues.apache.org/jira/browse/HDFS-11576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271561#comment-16271561
]
Lukas Majercak commented on HDFS-11576:
---------------------------------------
Hi [~shv], [~chris.douglas],
I've uploaded 012.patch to address some of Konstantin's comments:
# Removed BLOCK_RECOVERY_TIMEOUT_MULTIPLIER from DFSConfigKeys and added it as
a constant to BlockManager
# For the log message; the start of the recovery is logged in
internalReleaseLease and every rejected attempt is also logged in
PendingRecoveryBlocks
# PendingRecoveryBlocks.getTime(): this is there so that I can mock it for
testing PendingRecoveryBlocks and I can't see a nicer solution to this, happy
to hear suggestions
# For testRecoveryTimeout(), I changed callRealMethod to be final but kept the
name because "realMethodCalled" suggests the opposite logic
# Overrode SleepAnswer.answer() instead of creating new protected
SleepAnswer.callRealMethod()
> Block recovery will fail indefinitely if recovery time > heartbeat interval
> ---------------------------------------------------------------------------
>
> Key: HDFS-11576
> URL: https://issues.apache.org/jira/browse/HDFS-11576
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, hdfs, namenode
> Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha1, 3.0.0-alpha2
> Reporter: Lukas Majercak
> Assignee: Lukas Majercak
> Priority: Critical
> Attachments: HDFS-11576.001.patch, HDFS-11576.002.patch,
> HDFS-11576.003.patch, HDFS-11576.004.patch, HDFS-11576.005.patch,
> HDFS-11576.006.patch, HDFS-11576.007.patch, HDFS-11576.008.patch,
> HDFS-11576.009.patch, HDFS-11576.010.patch, HDFS-11576.011.patch,
> HDFS-11576.012.patch, HDFS-11576.repro.patch
>
>
> Block recovery will fail indefinitely if the time to recover a block is
> always longer than the heartbeat interval. Scenario:
> 1. DN sends heartbeat
> 2. NN sends a recovery command to DN, recoveryID=X
> 3. DN starts recovery
> 4. DN sends another heartbeat
> 5. NN sends a recovery command to DN, recoveryID=X+1
> 6. DN calls commitBlockSyncronization after succeeding with first recovery to
> NN, which fails because X < X+1
> ...
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]