[
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273817#comment-17273817
]
Stephen O'Donnell commented on HDFS-15798:
------------------------------------------
Thanks for finding and submitting a patch for this issue. It is a good find.
Could you make one small change. The variable `int xmitsSubmitted = 0;` on line
124 no longer needs to be defined outside the try block (it is not used in the
exception any more). Could you remove that line and just define the variable
where it is set:
{code}
int xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
{code}
If I understand this correctly, this problem can only occur is there are
several tasks to process in the loop:
1. First pass around the loop, sets xmitsSubmitted = X, say 5.
2. This is used to increment the DN XmitsInProgress.
3. Next pass around the loop, the exception is thrown. As xmitsSubmitted was
never reset to zero, the DN XmitsInProgress is decremented by the previous
value from the first pass (5 in this example).
However I believe your fix is correct, processErasureCodingTasks(...) should
only increment the xmits when it submits a task and then its the responsibility
of the StripedBlockReconstructor(...) task to decrement it again.
> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will
> be performed twice
> -----------------------------------------------------------------------------------------------
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: huhaiyang
> Assignee: huhaiyang
> Priority: Major
> Attachments: HDFS-15798.001.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress operation
> will be performed twice
> It would be XmitsInProgress of DN has negative number, it affects NN chooses
> pending tasks based on the ratio between the lengths of replication and
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection<BlockECReconstructionInfo> ecTasks) {
> for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
> ...
> // It may throw IllegalArgumentException from task#stripedReader
> // constructor.
> final StripedBlockReconstructor task =
> new StripedBlockReconstructor(this, stripedReconInfo);
> if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> // 1) NN will not send more tasks than what DN can execute and
> // 2) DN will not throw away reconstruction tasks, and instead keeps
> // an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task
> start increment
> stripedReconstructionPool.submit(task);
> } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
> }
> } catch (Throwable e) {
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task
> failed decrement
> LOG.warn("Failed to reconstruct striped block {}",
> reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
> }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
> try {
> initDecoderIfNecessary();
> ...
> } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
> } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task
> failed decrement
> ...
> }
> }{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]