[
https://issues.apache.org/jira/browse/HDFS-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479315#comment-13479315
]
Kihwal Lee commented on HDFS-4075:
----------------------------------
On recommissioning, the dead nodes will not cause this overhead at that moment
(i.e. not in the same write lock block). They will do their own share of
logging storm when they rejoin and send in the full block reports, which would
block the namenode for 6-7 seconds in the above example. They will at least let
others run in between such block reports. Or the nodes can be brought up in a
controlled manner to reduce the impact. E.g. two data node start-ups per minute.
But the live nodes at the time of recommissioning can cause problems, unless
processing of potentially over-replicated blocks become asynchronous to
recommissioning and also throttled. Doing invalidation inline but pausing and
releasing the lock won't be ideal since it will prolong the duration of
refreshNode command execution. Delaying this work using the mis-replicated
blocks handling can make it asynchronous, but it cannot be throttled; at the
next block report, all will be processed.
I think the simplest remedy is to disable the state change logging for block
invalidation during recommissioning.
On a busy namenode, the overhead of logging every block state change may not be
negligible. We might want to add a capability to selectively disable certain
class of state change logging. (There are already places that disables logging
for every block)
> Reduce recommissioning overhead
> -------------------------------
>
> Key: HDFS-4075
> URL: https://issues.apache.org/jira/browse/HDFS-4075
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.23.4, 2.0.2-alpha
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Critical
>
> When datanodes are recommissioned,
> {BlockManager#processOverReplicatedBlocksOnReCommission()} is called for each
> rejoined node and excess blocks are added to the invalidate list. The problem
> is this is done while the namesystem write lock is held.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira