[ 
https://issues.apache.org/jira/browse/HDFS-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848988#action_12848988
 ] 

Sanjay Radia commented on HDFS-1064:
------------------------------------

Some of the comments in the other Jiras have suggested that Yahoo has been 
working on only scalability and not availability.  Both availability and 
scalability are important issues for us. Most folks equate availability with 
automatic failover; but there  is more to availability than failover.  The 
original purpose for HDFS was to support a batch processing system. This 
allowed one to rely on restart since batch jobs can be delayed. However the 
SLAs requirements for batch jobs are getting tighter. Further Hadoop is 
beginning to be used for near online or online services. 

Below is some of the work that has happened in improving the availability of 
the NN and in moving  towards automatic failover. (some of these are in release 
20 and others in trunk). 

*  We have made a lot of progress in restarting a HDFS cluster. Two years ago, 
the restart time for a 2K cluster at Yahoo was several hours; one had to start 
100 DNs at a time whenever the NN was rebooted. In trunk we have measured the 
restart time for a 3K cluster to be 30 minutes.  Reducing restart time is 
important for failover: cold/warm failover performs all or part of the restart. 
 Some of the steps we took:
** reducing time to load fsImage and editlogs; you will see more of this in the 
next few months.
** reduce the cost of a block report - the initial block report is needed for 
the NN to start providing service.  Also we can safely restart the NN and deal 
with 3K initial block reports in our clusters.
Facebook's internal patch puts block reports and heartbeats on a separate port 
- I understand that this has helped the start up time.
* A major step towards HA was adding the backup namenode which synchronously 
gets the edit logs. This work needs to be extended to do an actual failiover. 
We are exploring manual failover using this backup NN and later doing an 
automatic failover using Zookeeper.  There is also on going work on integrating 
bookkeeper with the NN. (I will explain the tradeoffs of the Backup NNs vs the 
bookkeeper in a future comment).



> NN Availability - umbrella Jira
> -------------------------------
>
>                 Key: HDFS-1064
>                 URL: https://issues.apache.org/jira/browse/HDFS-1064
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Sanjay Radia
>
> This is an umbrella jira for discussing availability of the HDFS NN and 
> providing references to other Jiras that improve its availability. This 
> includes, but is not limited to, automatic failover. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to