[
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969341#comment-14969341
]
Walter Su commented on HDFS-9275:
---------------------------------
I keep digging, then I understand the whole steps:
# When client is writing blockGroup_0, DN1 sends a heartbeat, its xceiverCount=3
# Client finished writing blockGroup_0, blockGroup_1
# Shutdown DN8~10. So idx_6~8 of blockGroup_1 is missing.
# ReplicationMonitor schedules 1st recovery for blockGroup_1, because DN1 is
busy(See previous comments), BlockPlacementPolicy choose DN0,DN11 as targets.
# ErasureCodingWorker recovers idx_6 at DN0, and idx_7 at DN11. (See
getTargetIndices() you'll know why)
# Before idx_6,7 are reported, ReplicationMonitor schedules 2nd recovery for
blockGroup_1. It choose DN0 as targets.
# ErasureCodingWorker tries to recover idx_6 at DN0, it failed because DN0
complains replica exists.
A delayed heartbeat is the direct cause for the failed tests. The deep cause
is, It's not about the test code, It's about the defects of handling 2
concurrent EC recovery tasks:
# Defect in ReplicationMonitor. It shouldn't choose one DataNode as target
twice for the same block.
# Defect in ErasureCodingWorker. It doesn't know which internal blocks is in
recovering, or recovered. It purely guesses from live nodes.
> Fix TestRecoverStripedFile
> --------------------------
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: test
> Reporter: Walter Su
> Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)