[jira] [Commented] (HDFS-11285) Dead DataNodes keep a long time in (Dead, DECOMMISSION_INPROGRESS), and never transition to (Dead, DECOMMISSIONED)

Lantao Jin (JIRA) Wed, 04 Jan 2017 19:59:27 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15800224#comment-15800224
 ]


Lantao Jin commented on HDFS-11285:
-----------------------------------

Thanks [~andrew.wang], I'm not sure whether or not our case is a common one. We 
have an upper layer application which trigger and monitor the decommissioning 
progress. When it finds the "Blocks with no live replicas" becoming 0 in the NN 
UI, it will shutdown the DN. Why not wait for being transited to 
decommissioned, because that sometimes we found decommissioning progress took 
very much time which there were only one or two "Under replicated blocks" left.

So, after none of  "Blocks with no live replicas", the DN is shutdown. And its 
status become [Dead, Decommissioning] forever. Therefore, I need to run the 
four steps mentioned above to retire them.

In the code of HeartbeatManager and DecommissionManager.
{code}
if (!node.isDecommissionInProgress() && !node.isDecommissioned()) {
      // Update DN stats maintained by HeartbeatManager
      hbManager.startDecommission(node);
{code}
Only [Dead, Normal] status can be set [Dead, Decommissioned] directly.

> Dead DataNodes keep a long time in (Dead, DECOMMISSION_INPROGRESS), and never 
> transition to (Dead, DECOMMISSIONED)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11285
>                 URL: https://issues.apache.org/jira/browse/HDFS-11285
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Lantao Jin
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster. In a large cluster, 
> we met more than 100 nodes were dead, decommissioning and their {panel} Under 
> replicated blocks {panel} {panel} Blocks with no live replicas {panel} were 
> all ZERO. Actually It has been fixed in 
> [HDFS-7374|https://issues.apache.org/jira/browse/HDFS-7374]. After that, we 
> can refreshNode twice to eliminate this case. But, seems this patch missed 
> after refactor[HDFS-7411|https://issues.apache.org/jira/browse/HDFS-7411]. We 
> are using a Hadoop version based 2.7.1 and only below operations can 
> transition the status from {panel} Dead, DECOMMISSION_INPROGRESS {panel} to 
> {panel} Dead, DECOMMISSIONED {panel}:
> # Retire it from hdfs-exclude
> # refreshNodes
> # Re-add it to hdfs-exclude
> # refreshNodes
> So, why the code removed after refactor in the new DecommissionManager?
> {code:java}
> if (!node.isAlive) {
>   LOG.info("Dead node " + node + " is decommissioned immediately.");
>   node.setDecommissioned();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-11285) Dead DataNodes keep a long time in (Dead, DECOMMISSION_INPROGRESS), and never transition to (Dead, DECOMMISSIONED)

Reply via email to