[
https://issues.apache.org/jira/browse/HDFS-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848988#action_12848988
]
Sanjay Radia commented on HDFS-1064:
------------------------------------
Some of the comments in the other Jiras have suggested that Yahoo has been
working on only scalability and not availability. Both availability and
scalability are important issues for us. Most folks equate availability with
automatic failover; but there is more to availability than failover. The
original purpose for HDFS was to support a batch processing system. This
allowed one to rely on restart since batch jobs can be delayed. However the
SLAs requirements for batch jobs are getting tighter. Further Hadoop is
beginning to be used for near online or online services.
Below is some of the work that has happened in improving the availability of
the NN and in moving towards automatic failover. (some of these are in release
20 and others in trunk).
* We have made a lot of progress in restarting a HDFS cluster. Two years ago,
the restart time for a 2K cluster at Yahoo was several hours; one had to start
100 DNs at a time whenever the NN was rebooted. In trunk we have measured the
restart time for a 3K cluster to be 30 minutes. Reducing restart time is
important for failover: cold/warm failover performs all or part of the restart.
Some of the steps we took:
** reducing time to load fsImage and editlogs; you will see more of this in the
next few months.
** reduce the cost of a block report - the initial block report is needed for
the NN to start providing service. Also we can safely restart the NN and deal
with 3K initial block reports in our clusters.
Facebook's internal patch puts block reports and heartbeats on a separate port
- I understand that this has helped the start up time.
* A major step towards HA was adding the backup namenode which synchronously
gets the edit logs. This work needs to be extended to do an actual failiover.
We are exploring manual failover using this backup NN and later doing an
automatic failover using Zookeeper. There is also on going work on integrating
bookkeeper with the NN. (I will explain the tradeoffs of the Backup NNs vs the
bookkeeper in a future comment).
> NN Availability - umbrella Jira
> -------------------------------
>
> Key: HDFS-1064
> URL: https://issues.apache.org/jira/browse/HDFS-1064
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Sanjay Radia
>
> This is an umbrella jira for discussing availability of the HDFS NN and
> providing references to other Jiras that improve its availability. This
> includes, but is not limited to, automatic failover.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.