[jira] [Commented] (HDFS-12044) Mismatch between BlockManager#maxReplicationStreams and ErasureCodingWorker.stripedReconstructionPool pool size causes slow and bursty recovery

Andrew Wang (JIRA) Fri, 30 Jun 2017 17:39:23 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-12044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070905#comment-16070905
 ]


Andrew Wang commented on HDFS-12044:
------------------------------------

Thanks Eddy, looks real good, few comments:

* Can the LinkedBlockingDeque be a LinkedBlockingQueue? I don't think it needs 
the Deque functionality.
* SRInfo#getWeight adds together the # sources and # targets. I think this will 
overestimate. Note in StripedReader#readMinimumSources we only read from 
minRequiredSources. We also don't read and write at the same time, so it'd be 
better to take max(minSources, targets).
* In ECWorker, xmits is incremented after submitting the task. Is there a 
possible small race here? We could increment first to reserve capacity, then 
try/catch to decrement if the submit fails.
* Comment could be enhanced slightly, maybe:
{noformat}
          // See HDFS-12044. We increase xmitsInProgress even if the task is 
only
          // enqueued, so that
          //   1) NN will not send more tasks than DN can execute and
          //   2) DN will not throw away reconstruction tasks, and instead 
keeps an
          //      unbounded number of tasks in the executor's task queue.
{noformat}

I also had another question about accounting. The NN accounts for DN xciever 
load when doing block placement, but the xmit count is not factored in. The 
source and target DNs will each use an xceiver to send or receive the block, 
but the DN running the reconstruction task doesn't (AFAICT). Should we twiddle 
the xceiver count (or use an xceiver?) to influence BPP?

Aside, I noticed what looks like an existing bug, that DataNode#transferBlock 
does not create its Daemon in the xceiver thread group (which is how we 
currently count the # of xceivers). BlockRecoveryWorker#recoverBlocks is an 
example of something not in DataTransferProtocol that still counts against this 
thread group.

Unit tests:
* Could you add a unit test with two node failures, for some additional 
coverage? IIUC a single reconstruction task will recover all the missing blocks 
for an EC group, would be good to validate.
* Also would be good to do some reconstruct tasks and validate at the end that 
the xmitsInProgress for all DNs go back to zero at the end.

> Mismatch between BlockManager#maxReplicationStreams and 
> ErasureCodingWorker.stripedReconstructionPool pool size causes slow and 
> bursty recovery
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12044
>                 URL: https://issues.apache.org/jira/browse/HDFS-12044
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding
>    Affects Versions: 3.0.0-alpha3
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>              Labels: hdfs-ec-3.0-must-do
>         Attachments: HDFS-12044.00.patch, HDFS-12044.01.patch, 
> HDFS-12044.02.patch, HDFS-12044.03.patch
>
>
> {{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}} 
> and {{maxPoolSize=8}} as default. And it rejects more tasks if the queue is 
> full.
> When {{BlockManager#maxReplicationStream}} is larger than 
> {{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}}, 
> for example, {{maxReplicationStream=20}} and {{corePoolSize=2 , 
> maxPoolSize=8}}.  Meanwhile, NN sends up to {{maxTransfer}} reconstruction 
> tasks to DN for each heartbeat, and it is calculated in {{FSNamesystem}}:
> {code}
> final int maxTransfer = blockManager.getMaxReplicationStreams() - 
> xmitsInProgress;
> {code}
> However, at any giving time, 
> {{{ErasureCodingWorker#stripedReconstructionPool}} takes 2 {{xmitInProcess}}. 
> So for each heartbeat in 3s, NN will send about {{20-2 = 18}} reconstruction 
> tasks to the DN, and DN throw away most of them if there were 8 tasks in the 
> queue already. So NN needs to take longer to re-consider these blocks were 
> under-replicated to schedule new tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-12044) Mismatch between BlockManager#maxReplicationStreams and ErasureCodingWorker.stripedReconstructionPool pool size causes slow and bursty recovery

Reply via email to