[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276304#comment-17276304
 ] 

Stephen O'Donnell commented on HDFS-15798:
------------------------------------------

[~haiyang Hu] Thanks for the update. So this bug has not happened in production 
use, but would be possible if some run time exception occurred.

I think it makes sense to make this change, but I wonder if we need to change 
the test as part of this. The test is not checking the part of the code we are 
changing. I would be happy for this change to go in without changing any test 
updates, as its not going to be possible to test it effectively, and the change 
makes sense from reviewing the code.

What do you think [~ferhui]

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-15798
>                 URL: https://issues.apache.org/jira/browse/HDFS-15798
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: huhaiyang
>            Assignee: huhaiyang
>            Priority: Major
>         Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
>     Collection<BlockECReconstructionInfo> ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
>     int xmitsSubmitted = 0;
>     try {
>       ...
>       // It may throw IllegalArgumentException from task#stripedReader
>       // constructor.
>       final StripedBlockReconstructor task =
>           new StripedBlockReconstructor(this, stripedReconInfo);
>       if (task.hasValidTargets()) {
>         // See HDFS-12044. We increase xmitsInProgress even the task is only
>         // enqueued, so that
>         //   1) NN will not send more tasks than what DN can execute and
>         //   2) DN will not throw away reconstruction tasks, and instead keeps
>         //      an unbounded number of tasks in the executor's task queue.
>         xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
>         getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
>         stripedReconstructionPool.submit(task);
>       } else {
>         LOG.warn("No missing internal block. Skip reconstruction for task:{}",
>             reconInfo);
>       }
>     } catch (Throwable e) {
>       getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>       LOG.warn("Failed to reconstruct striped block {}",
>           reconInfo.getExtendedBlock().getLocalBlock(), e);
>     }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
>     initDecoderIfNecessary();
>    ...
>   } catch (Throwable e) {
>     LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
>     getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
>     float xmitWeight = getErasureCodingWorker().getXmitWeight();
>     // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
>     // because if it set to zero, we cannot to measure the xmits submitted
>     int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
>     getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
>     ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to