[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

Walter Su (JIRA) Thu, 22 Oct 2015 08:55:43 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969341#comment-14969341
 ]


Walter Su commented on HDFS-9275:
---------------------------------

I keep digging, then I understand the whole steps:

# When client is writing blockGroup_0, DN1 sends a heartbeat, its xceiverCount=3
# Client finished writing blockGroup_0, blockGroup_1
# Shutdown DN8~10. So idx_6~8 of blockGroup_1 is missing.
# ReplicationMonitor schedules 1st recovery for blockGroup_1, because DN1 is 
busy(See previous comments), BlockPlacementPolicy choose DN0,DN11 as targets.
# ErasureCodingWorker recovers idx_6 at DN0, and idx_7 at DN11. (See 
getTargetIndices() you'll know why)
# Before idx_6,7 are reported, ReplicationMonitor schedules 2nd recovery for 
blockGroup_1. It choose DN0 as targets.
# ErasureCodingWorker tries to recover idx_6 at DN0, it failed because DN0 
complains replica exists.

A delayed heartbeat is the direct cause for the failed tests. The deep cause 
is, It's not about the test code, It's about the defects of handling 2 
concurrent EC recovery tasks:
# Defect in ReplicationMonitor. It shouldn't choose one DataNode as target 
twice for the same block.
# Defect in ErasureCodingWorker. It doesn't know which internal blocks is in 
recovering, or recovered. It purely guesses from live nodes.

> Fix TestRecoverStripedFile
> --------------------------
>
>                 Key: HDFS-9275
>                 URL: https://issues.apache.org/jira/browse/HDFS-9275
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: test
>            Reporter: Walter Su
>            Assignee: Walter Su
>         Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

Reply via email to