[
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276283#comment-17276283
]
huhaiyang commented on HDFS-15798:
----------------------------------
[~sodonnell] We have encountered exceptions like this in our cluster
{code:java}
2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Failed to reconstruct striped block: BP-xxx:blk_-xxx
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Is currently in StripedBlockReconstructor#run–> catch(Throwable e) , and
finally run decrementing XmitsInProgress.
However No exception log to appear in ErasureCodingWorker#processErasureCoding
-->catch(Throwable e) .
> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has
> negative number
> --------------------------------------------------------------------------------------
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: huhaiyang
> Assignee: huhaiyang
> Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of
> processErasureCodingTasks operation abnormal value ;
> It would be XmitsInProgress of DN has negative number, it affects NN chooses
> pending tasks based on the ratio between the lengths of replication and
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection<BlockECReconstructionInfo> ecTasks) {
> for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
> ...
> // It may throw IllegalArgumentException from task#stripedReader
> // constructor.
> final StripedBlockReconstructor task =
> new StripedBlockReconstructor(this, stripedReconInfo);
> if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> // 1) NN will not send more tasks than what DN can execute and
> // 2) DN will not throw away reconstruction tasks, and instead keeps
> // an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start
> increment
> stripedReconstructionPool.submit(task);
> } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
> }
> } catch (Throwable e) {
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed
> decrement, XmitsInProgress is decremented by the previous value
> LOG.warn("Failed to reconstruct striped block {}",
> reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
> }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
> try {
> initDecoderIfNecessary();
> ...
> } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
> } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete
> decrement
> ...
> }
> }{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]