[ 
https://issues.apache.org/jira/browse/HDFS-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2929:
------------------------------

    Attachment: hdfs-2929.txt

Actually on second thought, given the title of this JIRA is "test and *fixes* 
for block synchronization" I'll include the bug fix here.

The bug is the following:
After a block synchronization occurs, the primary datanode (who acted as 
coordinator) calls {{commitBlockSynchronization}} on the active NN. This causes 
the NN to update the generation stamp on all of the replicas to the new 
generation stamp and persist the new genstamp to the edit log. But, the standby 
at this time doesn't get the new block locations -- it only gets the new 
genstamp through the edit log. This means that, after a synchronization, if we 
fail over again, it won't have any correct block locations.

The fix is simple: when a DN updates the genstamp on a block 
({{updateReplicaUnderRecovery}}) it adds the block, with its new genstamp, to 
the next incremental block report. This causes it to get reported to both the 
active and the standby node, so they both have correct information.
I think this will also improve behavior in the non-HA case as well -- in case 
the commitBlockSynchronization fails, we'll have more up-to-date information at 
the NN allowing reads of the in-progress block to continue during replica 
recovery.
                
> HA: stress test and fixes for block synchronization
> ---------------------------------------------------
>
>                 Key: HDFS-2929
>                 URL: https://issues.apache.org/jira/browse/HDFS-2929
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hdfs-2929.txt
>
>
> We have a couple of TODOs in {{commitBlockSynchronization}} and {{syncBlock}} 
> around HA. I think the current behavior may in fact be correct, but I plan to 
> write a stress test / functional test for better coverage of this area.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to