[ 
https://issues.apache.org/jira/browse/HDFS-11576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947498#comment-15947498
 ] 

Inigo Goiri commented on HDFS-11576:
------------------------------------

Thanks [~lukmajercak] for the fix.
The approach of keeping a timeout for each recovery in the {{BlockManager}} 
seems pretty clean to me.
Of course we still have the problem of what happens if the recovery takes 
longer than the timeout.
Nevertheless, I think the 3 minutes timeout should be plenty of time.
Regarding the default value, does anybody have a better idea for what this 
value should be?
Right now we have the timeout to be implicitly the heartbeat time (3 seconds by 
default) and we are moving it to 3 minutes.

To improve the patch I would add better logging.
Logging every time we ignore a block recovery seems excessive.
However, I would log every time we pass the timeout and we issue a new recovery.

> Block recovery will fail indefinitely if recovery time > heartbeat interval
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-11576
>                 URL: https://issues.apache.org/jira/browse/HDFS-11576
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs, namenode
>    Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha1, 3.0.0-alpha2
>            Reporter: Lukas Majercak
>            Assignee: Lukas Majercak
>            Priority: Critical
>         Attachments: HDFS-11576.001.patch, HDFS-11576.repro.patch
>
>
> Block recovery will fail indefinitely if the time to recover a block is 
> always longer than the heartbeat interval. Scenario:
> 1. DN sends heartbeat 
> 2. NN sends a recovery command to DN, recoveryID=X
> 3. DN starts recovery
> 4. DN sends another heartbeat
> 5. NN sends a recovery command to DN, recoveryID=X+1
> 6. DN calls commitBlockSyncronization after succeeding with first recovery to 
> NN, which fails because X < X+1
> ... 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to