[ 
https://issues.apache.org/jira/browse/HDFS-12044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069077#comment-16069077
 ] 

Andrew Wang commented on HDFS-12044:
------------------------------------

Thanks for the explanation Eddy, very helpful. I refreshed myself on the block 
reconstruction process. A review for myself and other watchers:

* BlockManager calculates reconstruction work and places tasks in the 
DatanodeDescriptor's queue, and also in PendingReconstruction work.
* handleHeartbeat polls the DD queue and gives tasks to the DN every heartbeat, 
based on {{maxReplicationStreams - xmitsInProgress}}
* PendingReconstruction will retry timedout tasks after 5 minutes

The core of the current issue is that the DN is refusing work from the NN 
because it exceeds numReconstructionThreads, and the NN keeps assigning more 
work because it thinks the DN has additional xmit capacity.

I think the basic fix here is to increment xmitsInProgress even for queued 
reconstruction work. This way {{maxReplicationStreams - xmitsInProgress}} will 
eventually be <= 0 and the NN will stop giving more work. Otherwise the queue 
will not converge.

I also wonder about the relative weights of EC and replicated reconstruction. 
20 EC reconstruction tasks is a different amount of work than 20 re-replication 
tasks. We should be counting each block reader in StripedReconstructor as its 
own xmit, i.e. an RS(10,4) recovery task would count as 10 xmits. I looked at 
this in HDFS-11023 and thought it was accounted for properly, but looking again 
that's not true.

More generally, I think HDFS-11023 is still worth revisiting. The NN throttles 
are coarse and only operate on the heartbeat interval. The DN would ideally 
have byte-based throttles the same as the balancer settings, to be more 
user-friendly.

> Mismatch between BlockManager#maxReplicatioStreams and 
> ErasureCodingWorker.stripedReconstructionPool pool size causes slow and burst 
> recovery. 
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12044
>                 URL: https://issues.apache.org/jira/browse/HDFS-12044
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding
>    Affects Versions: 3.0.0-alpha3
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-12044.00.patch, HDFS-12044.01.patch, 
> HDFS-12044.02.patch
>
>
> {{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}} 
> and {{maxPoolSize=8}} as default. And it rejects more tasks if the queue is 
> full.
> When {{BlockManager#maxReplicationStream}} is larger than 
> {{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}}, 
> for example, {{maxReplicationStream=20}} and {{corePoolSize=2 , 
> maxPoolSize=8}}.  Meanwhile, NN sends up to {{maxTransfer}} reconstruction 
> tasks to DN for each heartbeat, and it is calculated in {{FSNamesystem}}:
> {code}
> final int maxTransfer = blockManager.getMaxReplicationStreams() - 
> xmitsInProgress;
> {code}
> However, at any giving time, 
> {{{ErasureCodingWorker#stripedReconstructionPool}} takes 2 {{xmitInProcess}}. 
> So for each heartbeat in 3s, NN will send about {{20-2 = 18}} reconstruction 
> tasks to the DN, and DN throw away most of them if there were 8 tasks in the 
> queue already. So NN needs to take longer to re-consider these blocks were 
> under-replicated to schedule new tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to