[jira] [Commented] (HDFS-2601) Proposal to store edits and checkpoints inside HDFS itself for namenode HA

Andrew Ryan (Commented) (JIRA) Tue, 29 Nov 2011 23:08:07 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159871#comment-13159871
 ]


Andrew Ryan commented on HDFS-2601:
-----------------------------------

Following on Todd's comment, it seems to me we now have proposals floating 
around to store the image/edits data in one of three different locations: 
shared storage (NFS), Bookeeper, and HDFS.

NFS is stable and mature, but good, reliable, redundant NFS hardware is 
expensive and proprietary. The NFS filesystem has a lot of features which we 
don't need for this use case. There are other shared storage filesystems out 
there too, but as far as I know none of them are in wide use for storing 
image/edits data, so I'm only mentioning NFS.

HDFS is stable, but not for this use case, yet, and we'd need a lot of work to 
get it there. And we still end up with a stack that, like NFS, is a gazillion 
lines of code, but handles a lot of different stuff. Most of which we don't 
need for the image/edits application.

Bookeeper doesn't exist yet, but I like that it's special-purpose written just 
to provide a minimal set of features to enable the image/edits scenario. So 
hopefully it won't be a gazillion lines of code when it's done. But it will 
require a lot of time to stabilize and prove itself.

NFS as shared storage for image/edits works today. It's the basis of our HA 
namenode strategy at Facebook for the forseeable future. It's hard to debate 
the merits of HDFS vs. BK for image/edits storage, since neither exists, we're 
comparing unicorns to leprechauns. But my operational instincts and experience 
running Hadoop tell me that either BK or a separate HDFS would be best. But I'd 
need to better understand what the operational characteristics of each system 
were, and these are not well-defined yet.

I'm looking forward to more discussion and hearing various viewpoints.
                
> Proposal to store edits and checkpoints inside HDFS itself for namenode HA
> --------------------------------------------------------------------------
>
>                 Key: HDFS-2601
>                 URL: https://issues.apache.org/jira/browse/HDFS-2601
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Karthik Ranganathan
>
> Would have liked to make this a "brainstorming" JIRA but couldn't find the 
> option for some reason.
> I have talked to a quite a few people about this proposal at Facebook 
> internally (HDFS folks like Hairong and Dhruba, as well as HBase folks 
> interested in this feature), and wanted to broaden the audience.
> At the core of the HA feature, we need 2 things:
> A. the secondary NN (or avatar stand-by or whatever we call it) needs to read 
> all the fsedits and fsimage data written by the primary NN
> B. Once the stand-by has taken over, the old NN should not be allowed to make 
> any edits
> The basic idea is as follows (there are some variants, we can hone in on the 
> details if we like the general approach):
> 1. The write path for fsedits and fsimage: 
> 1.1 The NN uses a dfs client to write fsedits and fsimage. These will be 
> regular hdfs files written using the write pipeline.
> 1.2 Let us say the fsimage and fsedits files are written to a well-known 
> location in the local HDFS itself (say /.META or some such location)
> 1.3 The create files and add blocks to files in this path are not written to 
> fsimage or fsedits. The location of the blocks for the files in this location 
> are known to all namenodes - primary and standby - somehow (some 
> possibilities here - write these block ids to zk or use reserved block ids or 
> write some meta-data into the blocks itself and store the blocks in a well 
> known location on all the datanodes)
> 1.4 If the replication factor on the write pipeline decreases, we close the 
> block immediately and allow NN to re-replicate to bring up the replication 
> factor. We continue writing to a new block
> 2. The read path on a NN failure
> 2.1 Since the new NN "knows" the location of the blocks for the fsedits and 
> fsimage (again the same possibilities as mentioned above), there is nothing 
> to do to determine this
> 2.2 It can read the files it needs using the HDFS client itself
> 3. Fencing - if a NN is unresponsive, a new NN takes over, old NN should not 
> be allowed to perform any action
> 3.1 Use HDFS lease recovery for the fsedits and fsimage files - the new NN 
> will close all these files baing written to by the old NN (and hence all the 
> blocks)
> 3.2 The new NN (avatar NN) will write its address into ZK to let everyone 
> know its the master
> 3.3 The new NN now gets the lease for these files and starts writing into the 
> fsedits and fsimage
> 3.4 The old NN cannot write into the file as the block it was writing to was 
> closed and it does not have the lease. If it needs to re-open these files, it 
> needs to check zk to see it is indeed the current master, if not it should 
> exit.
> 4. Misc considerations:
> 4.1 If needed, we can specify favored nodes to place the blocks for this data 
> in specific set of nodes (say we want to use a different set of RAIDed nodes, 
> etc). 
> 4.2 Since we wont record the entries for /.META in fsedits and fsimage, a 
> "hadoop dfs -ls /" command wont show the files. This is probably ok, and can 
> be fixed if not.
> 4.3 If we have 256MB block sizes, then 20GB fsimage file would need 80 block 
> ids - the NN would need only these 80 block ids to read all the fsedits data. 
> The fsimage data is even lesser. This is very tractable using a variety of 
> the techniques (the possibilities mentioned above).
> The advantage is that we are re-using the existing HDFS client (with some 
> enhancements of course), and making the solution self-sufficient on the 
> existing HDFS. Also, the operational complexity is greatly reduced.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2601) Proposal to store edits and checkpoints inside HDFS itself for namenode HA

Reply via email to