[
https://issues.apache.org/jira/browse/HDFS-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215676#comment-17215676
]
Stephen O'Donnell commented on HDFS-15634:
------------------------------------------
The issue you have seen is a fairly extreme example - not too many clusters
will have 200 nodes decommissioned at the same time I suspect. The empty node
problem is valid concern on smaller clusters and in the
decommission-recommission case, the empty node may never catch up to the other
nodes, as the default block placement policy picks nodes randomly.
The long lock hold is a problem even when a node (or perhaps rack) goes dead
unexpectedly. I think it would be better to try to fix that generally, rather
than doing something special for decommission. I am also not really comfortable
changing the current "non destructive" decommission flow to something that
removes the blocks from the DN.
If you look at HeartBeatManager.heartbeatCheck(...), it seems to handle only 1
DN as dead on each check interval. The check interval is either 5 minute by
default or 30 seconds if "dfs.namenode.avoid.write.stale.datanode" is true.
Then it ultimately calls BlockManager.removeBlocksAssociatedTo(...) which does
the expensive work of removing the blocks. In that method, I wonder if we could
drop and re-take the write lock periodically so this does not hold the lock for
too long?
{code}
/** Remove the blocks associated to the given datanode. */
void removeBlocksAssociatedTo(final DatanodeDescriptor node) {
providedStorageMap.removeDatanode(node);
for (DatanodeStorageInfo storage : node.getStorageInfos()) {
final Iterator<BlockInfo> it = storage.getBlockIterator();
//add the BlockInfos to a new collection as the
//returned iterator is not modifiable.
Collection<BlockInfo> toRemove = new ArrayList<>();
while (it.hasNext()) {
toRemove.add(it.next());
}
// Could we drop and re-take the write lock in this loop every 1000
blocks?
for (BlockInfo b : toRemove) {
removeStoredBlock(b, node);
}
}
// Remove all pending DN messages referencing this DN.
pendingDNMessages.removeAllMessagesForDatanode(node);
node.resetBlocks();
invalidateBlocks.remove(node);
}
{code}
We see some nodes with 5M, 10M or even more blocks sometimes, so this would
help them in general.
I am not sure if there would be any negative consequences of dropping and
retaking the write lock in this scenario?
> Invalidate block on decommissioning DataNode after replication
> --------------------------------------------------------------
>
> Key: HDFS-15634
> URL: https://issues.apache.org/jira/browse/HDFS-15634
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Reporter: Fengnan Li
> Assignee: Fengnan Li
> Priority: Major
> Labels: pull-request-available
> Attachments: write lock.png
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Right now when a DataNode starts decommission, Namenode will mark it as
> decommissioning and its blocks will be replicated over to different
> DataNodes, then marked as decommissioned. These blocks are not touched since
> they are not counted as live replicas.
> Proposal: Invalidate these blocks once they are replicated and there are
> enough live replicas in the cluster.
> Reason: A recent shutdown of decommissioned datanodes to finished the flow
> caused Namenode latency spike since namenode needs to remove all of the
> blocks from its memory and this step requires holding write lock. If we have
> gradually invalidated these blocks the deletion will be much easier and
> faster.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]