[jira] [Commented] (HDFS-11576) Block recovery will fail indefinitely if recovery time > heartbeat interval

Konstantin Shvachko (JIRA) Wed, 20 Sep 2017 19:24:21 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174133#comment-16174133
 ]


Konstantin Shvachko commented on HDFS-11576:
--------------------------------------------

Thanks for pinging me, Brahma. Sorry for slow repsponse.
My main concern is the new config parameter. I think we should make the 
{{BLOCK_RECOVERY_TIMEOUT_MULTIPLIER}} a constant, not configurable.
If I understand [~lukmajercak] correctly, it was made configurable only for 
testing. We can address this by intorducing a method 
{code}
static long getBlockRecoveryTimeout() {
  return TimeUnit.SECONDS.toMillis(heartbeatIntervalSecs * 
BLOCK_RECOVERY_TIMEOUT_MULTIPLIER);
}
{code}
And either
# Make it visible for testing, or
# Create a test utility mocking this method, so that one could change the 
timeout for tests.

Both ways work for me.

Minor things:
# Would be good to add a log message stating that block recovery was been 
started but is still not complete.
Unless I missed such message as I don't see it in {{internalReleaseLease()}}. 
# {{PendingRecoveryBlocks.getTime()}} seems redundant. Static import should 
achieve the same.
# n {{testRecoveryTimeout()}} member {{callRealMethod}} should be final, 
otherwise you won't be able to backport in branch-2*. Would also rename it to 
{{realMethodCalled}}.
# And I don't understand adding new protected {{SleepAnswer.callRealMethod()}}, 
if you can just override the entire {{SleepAnswer.answer()}} in your test.

> Block recovery will fail indefinitely if recovery time > heartbeat interval
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-11576
>                 URL: https://issues.apache.org/jira/browse/HDFS-11576
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs, namenode
>    Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha1, 3.0.0-alpha2
>            Reporter: Lukas Majercak
>            Assignee: Lukas Majercak
>            Priority: Critical
>         Attachments: HDFS-11576.001.patch, HDFS-11576.002.patch, 
> HDFS-11576.003.patch, HDFS-11576.004.patch, HDFS-11576.005.patch, 
> HDFS-11576.006.patch, HDFS-11576.007.patch, HDFS-11576.008.patch, 
> HDFS-11576.009.patch, HDFS-11576.010.patch, HDFS-11576.011.patch, 
> HDFS-11576.repro.patch
>
>
> Block recovery will fail indefinitely if the time to recover a block is 
> always longer than the heartbeat interval. Scenario:
> 1. DN sends heartbeat 
> 2. NN sends a recovery command to DN, recoveryID=X
> 3. DN starts recovery
> 4. DN sends another heartbeat
> 5. NN sends a recovery command to DN, recoveryID=X+1
> 6. DN calls commitBlockSyncronization after succeeding with first recovery to 
> NN, which fails because X < X+1
> ... 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-11576) Block recovery will fail indefinitely if recovery time > heartbeat interval

Reply via email to