[ 
https://issues.apache.org/jira/browse/HDFS-11576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941101#comment-15941101
 ] 

Lukas Majercak commented on HDFS-11576:
---------------------------------------

the scenario from the logs:

*RECOVERY 1003 INITIATED ON THE DATANODE*
{code}
2017-03-24 13:35:02,572 INFO  datanode.DataNode 
(BlockRecoveryWorker.java:logRecoverBlock(486)) - NameNode at /127.0.0.1:32353 
calls 
recoverBlock(BP-796319647-10.123.116.86-1490387692090:blk_1073741825_1001, 
targets=[DatanodeInfoWithStorage[/default-rack/127.0.0.1:32515,null,null], 
DatanodeInfoWithStorage[/default-rack/127.0.0.1:32362,null,null], 
DatanodeInfoWithStorage[/default-rack/127.0.0.1:32439,null,null]], 
newGenerationStamp=1003, newBlock=null, isStriped=false)
{code}

*RECOVERY 1006 INITIATED ON THE DATANODE*
{code}
2017-03-24 13:35:08,564 INFO  datanode.DataNode 
(BlockRecoveryWorker.java:logRecoverBlock(486)) - NameNode at /127.0.0.1:32353 
calls 
recoverBlock(BP-796319647-10.123.116.86-1490387692090:blk_1073741825_1001, 
targets=[DatanodeInfoWithStorage[/default-rack/127.0.0.1:32515,null,null], 
DatanodeInfoWithStorage[/default-rack/127.0.0.1:32362,null,null], 
DatanodeInfoWithStorage[/default-rack/127.0.0.1:32439,null,null]], 
newGenerationStamp=1006, newBlock=null, isStriped=false)
{code} 

*RECOVERY 1003 REJECTED ON THE NAMENODE*
{code}
2017-03-24 13:35:09,001 INFO  ipc.Server (Server.java:run(2200)) - IPC Server 
handler 5 on 32353, call 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
 from 127.0.0.1:32591 Call#63 Retry#0
java.io.IOException: The recovery id 1003 does not match current recovery id 
1006 for block BP-796319647-10.123.116.86-1490387692090:blk_1073741825_1001
{code}

> Block recovery will fail indefinitely if recovery time > heartbeat interval
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-11576
>                 URL: https://issues.apache.org/jira/browse/HDFS-11576
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs, namenode
>    Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha1, 3.0.0-alpha2
>            Reporter: Lukas Majercak
>            Assignee: Lukas Majercak
>            Priority: Critical
>         Attachments: HDFS-11576.repro.patch
>
>
> Block recovery will fail indefinitely if the time to recover a block is 
> always longer than the heartbeat interval. Scenario:
> 1. DN sends heartbeat 
> 2. NN sends a recovery command to DN, recoveryID=X
> 3. DN starts recovery
> 4. DN sends another heartbeat
> 5. NN sends a recovery command to DN, recoveryID=X+1
> 6. DN calls commitBlockSyncronization after succeeding with first recovery to 
> NN, which fails because X < X+1
> ... 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to