[
https://issues.apache.org/jira/browse/HDFS-12044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070905#comment-16070905
]
Andrew Wang commented on HDFS-12044:
------------------------------------
Thanks Eddy, looks real good, few comments:
* Can the LinkedBlockingDeque be a LinkedBlockingQueue? I don't think it needs
the Deque functionality.
* SRInfo#getWeight adds together the # sources and # targets. I think this will
overestimate. Note in StripedReader#readMinimumSources we only read from
minRequiredSources. We also don't read and write at the same time, so it'd be
better to take max(minSources, targets).
* In ECWorker, xmits is incremented after submitting the task. Is there a
possible small race here? We could increment first to reserve capacity, then
try/catch to decrement if the submit fails.
* Comment could be enhanced slightly, maybe:
{noformat}
// See HDFS-12044. We increase xmitsInProgress even if the task is
only
// enqueued, so that
// 1) NN will not send more tasks than DN can execute and
// 2) DN will not throw away reconstruction tasks, and instead
keeps an
// unbounded number of tasks in the executor's task queue.
{noformat}
I also had another question about accounting. The NN accounts for DN xciever
load when doing block placement, but the xmit count is not factored in. The
source and target DNs will each use an xceiver to send or receive the block,
but the DN running the reconstruction task doesn't (AFAICT). Should we twiddle
the xceiver count (or use an xceiver?) to influence BPP?
Aside, I noticed what looks like an existing bug, that DataNode#transferBlock
does not create its Daemon in the xceiver thread group (which is how we
currently count the # of xceivers). BlockRecoveryWorker#recoverBlocks is an
example of something not in DataTransferProtocol that still counts against this
thread group.
Unit tests:
* Could you add a unit test with two node failures, for some additional
coverage? IIUC a single reconstruction task will recover all the missing blocks
for an EC group, would be good to validate.
* Also would be good to do some reconstruct tasks and validate at the end that
the xmitsInProgress for all DNs go back to zero at the end.
> Mismatch between BlockManager#maxReplicationStreams and
> ErasureCodingWorker.stripedReconstructionPool pool size causes slow and
> bursty recovery
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-12044
> URL: https://issues.apache.org/jira/browse/HDFS-12044
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: erasure-coding
> Affects Versions: 3.0.0-alpha3
> Reporter: Lei (Eddy) Xu
> Assignee: Lei (Eddy) Xu
> Labels: hdfs-ec-3.0-must-do
> Attachments: HDFS-12044.00.patch, HDFS-12044.01.patch,
> HDFS-12044.02.patch, HDFS-12044.03.patch
>
>
> {{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}}
> and {{maxPoolSize=8}} as default. And it rejects more tasks if the queue is
> full.
> When {{BlockManager#maxReplicationStream}} is larger than
> {{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}},
> for example, {{maxReplicationStream=20}} and {{corePoolSize=2 ,
> maxPoolSize=8}}. Meanwhile, NN sends up to {{maxTransfer}} reconstruction
> tasks to DN for each heartbeat, and it is calculated in {{FSNamesystem}}:
> {code}
> final int maxTransfer = blockManager.getMaxReplicationStreams() -
> xmitsInProgress;
> {code}
> However, at any giving time,
> {{{ErasureCodingWorker#stripedReconstructionPool}} takes 2 {{xmitInProcess}}.
> So for each heartbeat in 3s, NN will send about {{20-2 = 18}} reconstruction
> tasks to the DN, and DN throw away most of them if there were 8 tasks in the
> queue already. So NN needs to take longer to re-consider these blocks were
> under-replicated to schedule new tasks.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]