[
https://issues.apache.org/jira/browse/HDFS-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191436#comment-13191436
]
Todd Lipcon commented on HDFS-2791:
-----------------------------------
I've been thinking about this over the weekend and this morning. My current
thinking is that the safest bet is the following approach:
When an RBW block is reported for a finalized replica:
- Case 1) if the block has a too-low generation stamp, mark it corrupt.
- Case 2) if the block has the correct generation stamp, ignore it (don't add
to block locations or mark it corrupt)
Here's the reasoning:
*Case 1* One of the DNs is reporting a stale generation stamp.
This means that the client must have either appended to the block or undergone
pipeline recovery. There are two possibilities of why the DN is thus reporting
an old genstamp:
- 1a) it is a "delayed block report" as described in this JIRA. We will later
see a correct/up-to-date BR for the same block.
Here it is OK to mark the block as corrupt, since when we sent the "invalidate"
message to the DN, we'll invalidate the old genstamp specifically. So when the
DN receives the invalidation, it will not delete the new (correct) replica, but
rather just ignore it.
- 1b) the client lost its connection to this DN node and did a pipeline
recovery before closing the file. In this case we will never see a
correct/up-to-date BR.
Here it's also OK to mark it as corrupt, because it really is corrupt (ie
didn't participate in the block recovery).
*Case 2* correct generation stamp, but RBW report on a FINALIZED block
As far as I can think, the only way we can get here is with the "delayed
report" scenario described in this JIRA. The reasoning is as follows:
- in order for the client to call completeBlock(), it must have gotten a
successful pipeline close from all of the DNs in the current pipeline
- if the pipeline nodes had changed, it would have gotten a different
generation stamp. So, all of the nodes that have a block with the correct
genstamp were in the closed pipeline
- thus all of the nodes with the correct genstamp would have the correct length
and state, and any report otherwise is because of a message delay.
The only other possibility is something like a machine crash which doesn't
replay the ext3 journal causing some blocks to get rolled back to a prior
state. In this case, upon restart, the DN would change it to be a RWR
(ReplicaWaitingRecovery) and we could use the original logic of marking it
corrupt.
I think the above solution is safer and simpler than any other solutions I
could come up with.
> If block report races with closing of file, replica is incorrectly marked
> corrupt
> ---------------------------------------------------------------------------------
>
> Key: HDFS-2791
> URL: https://issues.apache.org/jira/browse/HDFS-2791
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: data-node, name-node
> Affects Versions: 0.22.0, 0.23.0
> Reporter: Todd Lipcon
> Attachments: hdfs-2791-test.txt
>
>
> The following sequence of events results in a replica mistakenly marked
> corrupt:
> 1. Pipeline is open with 2 replicas
> 2. DN1 generates a block report but is slow in sending to the NN (eg some
> flaky network). It gets "stuck" right before the block report RPC.
> 3. Client closes the file.
> 4. DN2 is fast and sends blockReceived to the NN. NN marks the block as
> COMPLETE
> 5. DN1's block report proceeds, and includes the block in an RBW state.
> 6. (x) NN incorrectly marks the replica as corrupt, since it is an RBW
> replica on a COMPLETE block.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira