[ 
https://issues.apache.org/jira/browse/HDFS-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192775#comment-13192775
 ] 

Todd Lipcon commented on HDFS-2791:
-----------------------------------

bq. Can you explain why the situation is different for HA?

Yes. Imagine the following sequence, with two NNs, NNA (active) and 
NNB(standby):

1) file with replication count 1 is being written
2) DN generates a block report with RBW and starts sending it to NNB. The 
network has a problem and this block report doesn't arrive for a while (eg DN 
is temporarily partitioned from NNB, so the DN keeps retrying for minutes)
3) file is closed
4) NNB tails edits, sees OP_ADD and OP_CLOSE, marks the block as COMPLETE, even 
though it hasn't seen any replicas yet.
5) Network partition is resolved, and NNB receives the block report with an RBW 
replica, mistakenly marking it as corrupt.
                
> If block report races with closing of file, replica is incorrectly marked 
> corrupt
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-2791
>                 URL: https://issues.apache.org/jira/browse/HDFS-2791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node, name-node
>    Affects Versions: 0.22.0, 0.23.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-2791-test.txt, hdfs-2791.txt, hdfs-2791.txt, 
> hdfs-2791.txt
>
>
> The following sequence of events results in a replica mistakenly marked 
> corrupt:
> 1. Pipeline is open with 2 replicas
> 2. DN1 generates a block report but is slow in sending to the NN (eg some 
> flaky network). It gets "stuck" right before the block report RPC.
> 3. Client closes the file.
> 4. DN2 is fast and sends blockReceived to the NN. NN marks the block as 
> COMPLETE
> 5. DN1's block report proceeds, and includes the block in an RBW state.
> 6. (x) NN incorrectly marks the replica as corrupt, since it is an RBW 
> replica on a COMPLETE block.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to