Aihua Xu created HDFS-16200:
-------------------------------

             Summary: Improve NameNode failover
                 Key: HDFS-16200
                 URL: https://issues.apache.org/jira/browse/HDFS-16200
             Project: Hadoop HDFS
          Issue Type: Task
          Components: namanode
    Affects Versions: 2.8.2
            Reporter: Aihua Xu
            Assignee: Aihua Xu


In a busy cluster, we are noticing the NameNode failover takes longer time 
(over 10 minutes) and it causes cluster down time during the time period.

One bottleneck locates in resolving the client host's topology when the cluster 
is not colocated with the computing hosts. NameNode resolves the client host's 
topology and uses it to sort the hosts where the blocks locate in. Such 
topology will be cached so the next access will be efficient, while if the 
standby NameNode is newly restarted, then all the client hosts, e.g., YARN 
hosts need to be resolved.

Solutions can be: 1) we can expose an API in DFSAdmin to load topology cache, 
or 2) we can add a new configuration in HDFS cluster to skip resolving topology 
for non-colocated HDFS cluster. Since client hosts and HDFS hosts are not 
colocated, it's unnecessary to sort the DataNodes for the clients.     



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to