[ http://issues.apache.org/jira/browse/HADOOP-90?page=all ]
Sameer Paranjpye reassigned HADOOP-90:
--------------------------------------
Assignee: Konstantin Shvachko
> DFS is succeptible to data loss in case of name node failure
> ------------------------------------------------------------
>
> Key: HADOOP-90
> URL: http://issues.apache.org/jira/browse/HADOOP-90
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.1.0
> Reporter: Yoram Arnon
> Assigned To: Konstantin Shvachko
> Fix For: 0.6.0
>
> Attachments: multipleEditsDest.patch
>
>
> Currently, DFS name node stores its log and state in local files.
> This has the disadvantage that a hardware failure of the name node causes a
> total data loss.
> Several approaches may be used to address this flaw:
> 1. replicate the name server state files using copy or rsync once in a while,
> either manually or using a cron job.
> 2. set up secondary name servers and a protocol whereby the primary updates
> the secondaries. In case of failure, a secondary can take over.
> 3. store the state files as distributed, replicated files in the DFS itself.
> The difficulty is that it becomes a bootstrap problem, where the name node
> needs some information, typically stored in its state files, in order to read
> those same state files.
> solution 1 is fine for non critical systems, but for systems that need to
> guarantee no data loss it's insufficient.
> Solutions 2 and 3 both seem valid; 3 seems more elegant in that it doesn't
> require an extra protocol, it leverages the DFS and allows any level of
> replication for robustness. Below is a proposition for solution 3.
> 1. The name node, when it starts up, needs some basic information. That
> information is not large and can easily be stored in a single block of DFS.
> We hard code the block location, using block id 0. Block 0 will contain the
> list of blocks that contain the name node metadata - not the metadata itself
> (file names, servers, blocks etc), just the list of blocks that contain it.
> With a block identified by 8 bytes, and 32 MB blocks, we can fit 256K block
> id's in block 0. 256K blocks of 32MB each can hold 8TB of metadata, which can
> map a large enough file system, so a single block of block_ids is sufficient.
> 2. The name node writes his state basically the same way as now: log file
> plus occasional full state. DFS needs to change to commit changes to open
> files while allowing continued writing to them, or else the log file wouldn't
> be valid on name server failure, before the file is closed.
> 3. The name node will use double buffering for its state, using blocks 0
> and 1. Starting with block 0, it writes its state, then a log of changes.
> When it's time to write a new state it writes it to node 1. The state
> includes a generation number, a single byte starting at 0, to enable the name
> server to identify the valid state. A CRC is written at the end of the block
> to mark its validity and completeness. The log file is identified by the same
> generation number as the state it relates to.
> 4. The log file will be limited to a single block as well. When that block
> fills up a new state is written. 32MB of transaction logs should suffice. If
> not, we could set aside a set of blocks, and set aside a few locations in the
> super-block (block 0/1) to store that set of block ids.
> 5. The super-block, the log and the metadata blocks may be exposed as read
> only files in reserved files in the DFS: /.metadata/* or something.
> 6. When a name nodes starts, it waits for data nodes to connect to it to
> report their blocks. It waits until it gets a report about blocks 0 and 1,
> from which it can continue to read its entire state. After that it continues
> normally.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira