[
https://issues.apache.org/jira/browse/HDFS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159871#comment-13159871
]
Andrew Ryan commented on HDFS-2601:
-----------------------------------
Following on Todd's comment, it seems to me we now have proposals floating
around to store the image/edits data in one of three different locations:
shared storage (NFS), Bookeeper, and HDFS.
NFS is stable and mature, but good, reliable, redundant NFS hardware is
expensive and proprietary. The NFS filesystem has a lot of features which we
don't need for this use case. There are other shared storage filesystems out
there too, but as far as I know none of them are in wide use for storing
image/edits data, so I'm only mentioning NFS.
HDFS is stable, but not for this use case, yet, and we'd need a lot of work to
get it there. And we still end up with a stack that, like NFS, is a gazillion
lines of code, but handles a lot of different stuff. Most of which we don't
need for the image/edits application.
Bookeeper doesn't exist yet, but I like that it's special-purpose written just
to provide a minimal set of features to enable the image/edits scenario. So
hopefully it won't be a gazillion lines of code when it's done. But it will
require a lot of time to stabilize and prove itself.
NFS as shared storage for image/edits works today. It's the basis of our HA
namenode strategy at Facebook for the forseeable future. It's hard to debate
the merits of HDFS vs. BK for image/edits storage, since neither exists, we're
comparing unicorns to leprechauns. But my operational instincts and experience
running Hadoop tell me that either BK or a separate HDFS would be best. But I'd
need to better understand what the operational characteristics of each system
were, and these are not well-defined yet.
I'm looking forward to more discussion and hearing various viewpoints.
> Proposal to store edits and checkpoints inside HDFS itself for namenode HA
> --------------------------------------------------------------------------
>
> Key: HDFS-2601
> URL: https://issues.apache.org/jira/browse/HDFS-2601
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: name-node
> Reporter: Karthik Ranganathan
>
> Would have liked to make this a "brainstorming" JIRA but couldn't find the
> option for some reason.
> I have talked to a quite a few people about this proposal at Facebook
> internally (HDFS folks like Hairong and Dhruba, as well as HBase folks
> interested in this feature), and wanted to broaden the audience.
> At the core of the HA feature, we need 2 things:
> A. the secondary NN (or avatar stand-by or whatever we call it) needs to read
> all the fsedits and fsimage data written by the primary NN
> B. Once the stand-by has taken over, the old NN should not be allowed to make
> any edits
> The basic idea is as follows (there are some variants, we can hone in on the
> details if we like the general approach):
> 1. The write path for fsedits and fsimage:
> 1.1 The NN uses a dfs client to write fsedits and fsimage. These will be
> regular hdfs files written using the write pipeline.
> 1.2 Let us say the fsimage and fsedits files are written to a well-known
> location in the local HDFS itself (say /.META or some such location)
> 1.3 The create files and add blocks to files in this path are not written to
> fsimage or fsedits. The location of the blocks for the files in this location
> are known to all namenodes - primary and standby - somehow (some
> possibilities here - write these block ids to zk or use reserved block ids or
> write some meta-data into the blocks itself and store the blocks in a well
> known location on all the datanodes)
> 1.4 If the replication factor on the write pipeline decreases, we close the
> block immediately and allow NN to re-replicate to bring up the replication
> factor. We continue writing to a new block
> 2. The read path on a NN failure
> 2.1 Since the new NN "knows" the location of the blocks for the fsedits and
> fsimage (again the same possibilities as mentioned above), there is nothing
> to do to determine this
> 2.2 It can read the files it needs using the HDFS client itself
> 3. Fencing - if a NN is unresponsive, a new NN takes over, old NN should not
> be allowed to perform any action
> 3.1 Use HDFS lease recovery for the fsedits and fsimage files - the new NN
> will close all these files baing written to by the old NN (and hence all the
> blocks)
> 3.2 The new NN (avatar NN) will write its address into ZK to let everyone
> know its the master
> 3.3 The new NN now gets the lease for these files and starts writing into the
> fsedits and fsimage
> 3.4 The old NN cannot write into the file as the block it was writing to was
> closed and it does not have the lease. If it needs to re-open these files, it
> needs to check zk to see it is indeed the current master, if not it should
> exit.
> 4. Misc considerations:
> 4.1 If needed, we can specify favored nodes to place the blocks for this data
> in specific set of nodes (say we want to use a different set of RAIDed nodes,
> etc).
> 4.2 Since we wont record the entries for /.META in fsedits and fsimage, a
> "hadoop dfs -ls /" command wont show the files. This is probably ok, and can
> be fixed if not.
> 4.3 If we have 256MB block sizes, then 20GB fsimage file would need 80 block
> ids - the NN would need only these 80 block ids to read all the fsedits data.
> The fsimage data is even lesser. This is very tractable using a variety of
> the techniques (the possibilities mentioned above).
> The advantage is that we are re-using the existing HDFS client (with some
> enhancements of course), and making the solution self-sufficient on the
> existing HDFS. Also, the operational complexity is greatly reduced.
> Thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira