[jira] [Created] (HDFS-2601) Proposal to build namenode HA inside HDFS itself

Karthik Ranganathan (Created) (JIRA) Tue, 29 Nov 2011 11:20:06 -0800

Proposal to build namenode HA inside HDFS itself
------------------------------------------------


                 Key: HDFS-2601
                 URL: https://issues.apache.org/jira/browse/HDFS-2601
             Project: Hadoop HDFS
          Issue Type: New Feature
          Components: name-node
            Reporter: Karthik Ranganathan


Would have liked to make this a "brainstorming" JIRA but couldn't find the 
option for some reason.

I have talked to a quite a few people about this proposal at Facebook 
internally (HDFS folks like Hairong and Dhruba, as well as HBase folks 
interested in this feature), and wanted to broaden the audience.

At the core of the HA feature, we need 2 things:
A. the secondary NN (or avatar stand-by or whatever we call it) needs to read 
all the fsedits and fsimage data written by the primary NN
B. Once the stand-by has taken over, the old NN should not be allowed to make 
any edits

The basic idea is as follows (there are some variants, we can hone in on the 
details if we like the general approach):

1. The write path for fsedits and fsimage: 

1.1 The NN uses a dfs client to write fsedits and fsimage. These will be 
regular hdfs files written using the write pipeline.
1.2 Let us say the fsimage and fsedits files are written to a well-known 
location in the local HDFS itself (say /.META or some such location)
1.3 The create files and add blocks to files in this path are not written to 
fsimage or fsedits. The location of the blocks for the files in this location 
are known to all namenodes - primary and standby - somehow (some possibilities 
here - write these block ids to zk or use reserved block ids or write some 
meta-data into the blocks itself and store the blocks in a well known location 
on all the datanodes)
1.4 If the replication factor on the write pipeline decreases, we close the 
block immediately and allow NN to re-replicate to bring up the replication 
factor. We continue writing to a new block

2. The read path on a NN failure
2.1 Since the new NN "knows" the location of the blocks for the fsedits and 
fsimage (again the same possibilities as mentioned above), there is nothing to 
do to determine this
2.2 It can read the files it needs using the HDFS client itself

3. Fencing - if a NN is unresponsive, a new NN takes over, old NN should not be 
allowed to perform any action
3.1 Use HDFS lease recovery for the fsedits and fsimage files - the new NN will 
close all these files baing written to by the old NN (and hence all the blocks)
3.2 The new NN (avatar NN) will write its address into ZK to let everyone know 
its the master
3.3 The new NN now gets the lease for these files and starts writing into the 
fsedits and fsimage
3.4 The old NN cannot write into the file as the block it was writing to was 
closed and it does not have the lease. If it needs to re-open these files, it 
needs to check zk to see it is indeed the current master, if not it should exit.

4. Misc considerations:
4.1 If needed, we can specify favored nodes to place the blocks for this data 
in specific set of nodes (say we want to use a different set of RAIDed nodes, 
etc). 
4.2 Since we wont record the entries for /.META in fsedits and fsimage, a 
"hadoop dfs -ls /" command wont show the files. This is probably ok, and can be 
fixed if not.
4.3 If we have 256MB block sizes, then 20GB fsimage file would need 80 block 
ids - the NN would need only these 80 block ids to read all the fsedits data. 
The fsimage data is even lesser. This is very tractable using a variety of the 
techniques (the possibilities mentioned above).

The advantage is that we are re-using the existing HDFS client (with some 
enhancements of course), and making the solution self-sufficient on the 
existing HDFS. Also, the operational complexity is greatly reduced.

Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HDFS-2601) Proposal to build namenode HA inside HDFS itself

Reply via email to