Daryn Sharp created HDFS-12703:
----------------------------------

             Summary: Exceptions are fatal to decommissioning monitor
                 Key: HDFS-12703
                 URL: https://issues.apache.org/jira/browse/HDFS-12703
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.7.0
            Reporter: Daryn Sharp
            Priority: Critical


The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If an 
exception occurs, all decommissioning ceases until the NN is restarted.  Per 
javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the task 
encounters an exception, subsequent executions are suppressed*.  The monitor 
thread is alive but blocked waiting for an executor task that will never come.  
The code currently disposes of the future so the actual exception that aborted 
the task is gone.

Failover is insufficient since the task is also likely dead on the standby.  
Replication queue init after the transition to active will fix the under 
replication of blocks on currently decommissioning nodes but future nodes never 
decommission.  The standby must be bounced prior to failover – and hopefully 
the error condition does not reoccur.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to