[
https://issues.apache.org/jira/browse/HDFS-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186407#comment-13186407
]
Todd Lipcon commented on HDFS-2791:
-----------------------------------
To start brainstorming a solution, here are a few scattered thoughts:
- The basic requirement is that we should always accept a "past" valid state of
the block, since we don't have any global barrier before the block is marked
complete. So we need to support RBW-before-finalized even if it has a too-short
length, for example. There might be other bugs similar to this if the file is
re-opened for append (eg a block is reported with a too-young generation stamp
racing with the re-open).
- I think this can only happen in the _first_ block report after a block's
state changes, since each block report freshly examines the DN state. So maybe
we can use the block report timestamps to our advantage somehow? ie if the
block changed state in between the previous block report was received and this
one, the BR might have raced?
The floor is open for creative simple solutions :)
> If block report races with closing of file, replica is incorrectly marked
> corrupt
> ---------------------------------------------------------------------------------
>
> Key: HDFS-2791
> URL: https://issues.apache.org/jira/browse/HDFS-2791
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: data-node, name-node
> Affects Versions: 0.22.0, 0.23.0
> Reporter: Todd Lipcon
>
> The following sequence of events results in a replica mistakenly marked
> corrupt:
> 1. Pipeline is open with 2 replicas
> 2. DN1 generates a block report but is slow in sending to the NN (eg some
> flaky network). It gets "stuck" right before the block report RPC.
> 3. Client closes the file.
> 4. DN2 is fast and sends blockReceived to the NN. NN marks the block as
> COMPLETE
> 5. DN1's block report proceeds, and includes the block in an RBW state.
> 6. (x) NN incorrectly marks the replica as corrupt, since it is an RBW
> replica on a COMPLETE block.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira