[
https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186017#comment-15186017
]
Zhe Zhang commented on HDFS-9822:
---------------------------------
Interesting case here. I think it happens when 1) a under replicated striped
block is put into a queue; 2) the block group loses another internal block; 3)
then we put another entry in a higher priority queue for the same block group.
To answer Kai's question: I don't think we should schedule all EC tasks in the
same queue. If a block group has lost 3 internal blocks, we should treat it
with higher priority than one that has lost 1.
Thanks for the fix [~rakeshr]. The below logic only checks the {{oldPri}}
queue. I guess the block could be in other queues as well?
{code}
else if (block.isStriped()
&& !priorityQueues.get(oldPri).contains(block)) {
{code}
> Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped
> block at the same time
> ----------------------------------------------------------------------------------------------------
>
> Key: HDFS-9822
> URL: https://issues.apache.org/jira/browse/HDFS-9822
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: erasure-coding
> Reporter: Tsz Wo Nicholas Sze
> Assignee: Rakesh R
> Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch
>
>
> Found the following AssertionError in
> https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/
> {code}
> AssertionError: Should wait the previous reconstruction to finish
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100)
> at java.lang.Thread.run(Thread.java:745)
> at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126)
> at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)