[
https://issues.apache.org/jira/browse/HDFS-12044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069077#comment-16069077
]
Andrew Wang commented on HDFS-12044:
------------------------------------
Thanks for the explanation Eddy, very helpful. I refreshed myself on the block
reconstruction process. A review for myself and other watchers:
* BlockManager calculates reconstruction work and places tasks in the
DatanodeDescriptor's queue, and also in PendingReconstruction work.
* handleHeartbeat polls the DD queue and gives tasks to the DN every heartbeat,
based on {{maxReplicationStreams - xmitsInProgress}}
* PendingReconstruction will retry timedout tasks after 5 minutes
The core of the current issue is that the DN is refusing work from the NN
because it exceeds numReconstructionThreads, and the NN keeps assigning more
work because it thinks the DN has additional xmit capacity.
I think the basic fix here is to increment xmitsInProgress even for queued
reconstruction work. This way {{maxReplicationStreams - xmitsInProgress}} will
eventually be <= 0 and the NN will stop giving more work. Otherwise the queue
will not converge.
I also wonder about the relative weights of EC and replicated reconstruction.
20 EC reconstruction tasks is a different amount of work than 20 re-replication
tasks. We should be counting each block reader in StripedReconstructor as its
own xmit, i.e. an RS(10,4) recovery task would count as 10 xmits. I looked at
this in HDFS-11023 and thought it was accounted for properly, but looking again
that's not true.
More generally, I think HDFS-11023 is still worth revisiting. The NN throttles
are coarse and only operate on the heartbeat interval. The DN would ideally
have byte-based throttles the same as the balancer settings, to be more
user-friendly.
> Mismatch between BlockManager#maxReplicatioStreams and
> ErasureCodingWorker.stripedReconstructionPool pool size causes slow and burst
> recovery.
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-12044
> URL: https://issues.apache.org/jira/browse/HDFS-12044
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: erasure-coding
> Affects Versions: 3.0.0-alpha3
> Reporter: Lei (Eddy) Xu
> Assignee: Lei (Eddy) Xu
> Attachments: HDFS-12044.00.patch, HDFS-12044.01.patch,
> HDFS-12044.02.patch
>
>
> {{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}}
> and {{maxPoolSize=8}} as default. And it rejects more tasks if the queue is
> full.
> When {{BlockManager#maxReplicationStream}} is larger than
> {{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}},
> for example, {{maxReplicationStream=20}} and {{corePoolSize=2 ,
> maxPoolSize=8}}. Meanwhile, NN sends up to {{maxTransfer}} reconstruction
> tasks to DN for each heartbeat, and it is calculated in {{FSNamesystem}}:
> {code}
> final int maxTransfer = blockManager.getMaxReplicationStreams() -
> xmitsInProgress;
> {code}
> However, at any giving time,
> {{{ErasureCodingWorker#stripedReconstructionPool}} takes 2 {{xmitInProcess}}.
> So for each heartbeat in 3s, NN will send about {{20-2 = 18}} reconstruction
> tasks to the DN, and DN throw away most of them if there were 8 tasks in the
> queue already. So NN needs to take longer to re-consider these blocks were
> under-replicated to schedule new tasks.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]