[ 
https://issues.apache.org/jira/browse/HDFS-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192837#comment-13192837
 ] 

Todd Lipcon commented on HDFS-2791:
-----------------------------------

bq. I agree that it it is safer to ignore it if the gen-stamp AND length match 
rather than mark the replica as corrupt.
It's quite likely that the length will be shorter (eg because the client kept 
writing after the block report was generated)

bq. The Op_Add contains the block location ( I don't think it contains the 
block location).
Right, it doesn't include the block location.

bq. the AddBlock from the DN arrive at the NNB shortly after the Block report 
containing RBW. 

Yep, but I don't think that causes it to lose its "corrupt" status. I suppose 
we could add a special flag to the corrupt block entry saying "corrupt due to 
state" rather than "corrupt due to checksum errors", and if we receive a 
correct-state reported block for a "corrupt due to state" block, we unflag the 
corruption. Does that solution seem preferable to you?

                
> If block report races with closing of file, replica is incorrectly marked 
> corrupt
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-2791
>                 URL: https://issues.apache.org/jira/browse/HDFS-2791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node, name-node
>    Affects Versions: 0.22.0, 0.23.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-2791-test.txt, hdfs-2791.txt, hdfs-2791.txt, 
> hdfs-2791.txt
>
>
> The following sequence of events results in a replica mistakenly marked 
> corrupt:
> 1. Pipeline is open with 2 replicas
> 2. DN1 generates a block report but is slow in sending to the NN (eg some 
> flaky network). It gets "stuck" right before the block report RPC.
> 3. Client closes the file.
> 4. DN2 is fast and sends blockReceived to the NN. NN marks the block as 
> COMPLETE
> 5. DN1's block report proceeds, and includes the block in an RBW state.
> 6. (x) NN incorrectly marks the replica as corrupt, since it is an RBW 
> replica on a COMPLETE block.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to