[
http://issues.apache.org/jira/browse/HADOOP-90?page=comments#action_12412591 ]
alan wootton commented on HADOOP-90:
------------------------------------
I thought about this for a while, and I think this is how I would attack the
problem.
First, I introduce the concept of a 'precious file' in dfs. A precious file is
a file that can be recovered even if all NameNode state is lost. This is
probably not a new object but just a boolean in FSDirectory.INode to indicate
that the file is precious (and a boolean in FileUnderConstruction).
A precious file is composed of PreciousBlock's, instead of Block's like a
normal dfs file:
public class PreciousBlock extends Block {
UTF8 pathName; // absolute dfs path
long timestamp;// milliseconds since last change
int sequence; // which block this is. ie 0th, first, etc.
int total; // count of blocks for the file
short replication;
public void write(DataOutput out) ...
public void readFields(DataInput in) ... etc.
}
Whenever a DataNode creates, or replicates, a block for a file that is precious
it also serializes the PreciousBlock to a file in it's data directory(s).
You would see something like this in a datanode directory;
blk_2241143806243395050
blk_385073589724654571
blk_8416156569406441156
precious_385073589724654571
This would mean that block #385073589724654571 is part of a precious file. The
file precious_385073589724654571 contains a serialization of PreciousBlock.
Files stored this way can be recovered with out any NameNode state at all. The
NameNode could simply wait until the DataNodes report their blocks and
reconstruct all the precious files. If a precious file is widely replicated it
becomes almost impossible to ever lose. The purpose of the timestamp is for the
case where a file is renamed, or deleted and recreated, and a datanode didn't
get the message.
The next part is easy. The NameNode just writes its image and edits to precious
files in the dfs. The NameNode can easily keep several old versions of its
state, or any other good tricks. Old NameNode images could be rotated like log
files, etc.
Recovery from loss, or corruption, of NameNode state is relatively simple. The
NameNode recovers the precious files and then just reads the latest reliable
image/edits.
> DFS is succeptible to data loss in case of name node failure
> ------------------------------------------------------------
>
> Key: HADOOP-90
> URL: http://issues.apache.org/jira/browse/HADOOP-90
> Project: Hadoop
> Type: Bug
> Components: dfs
> Versions: 0.1.0
> Reporter: Yoram Arnon
>
> Currently, DFS name node stores its log and state in local files.
> This has the disadvantage that a hardware failure of the name node causes a
> total data loss.
> Several approaches may be used to address this flaw:
> 1. replicate the name server state files using copy or rsync once in a while,
> either manually or using a cron job.
> 2. set up secondary name servers and a protocol whereby the primary updates
> the secondaries. In case of failure, a secondary can take over.
> 3. store the state files as distributed, replicated files in the DFS itself.
> The difficulty is that it becomes a bootstrap problem, where the name node
> needs some information, typically stored in its state files, in order to read
> those same state files.
> solution 1 is fine for non critical systems, but for systems that need to
> guarantee no data loss it's insufficient.
> Solutions 2 and 3 both seem valid; 3 seems more elegant in that it doesn't
> require an extra protocol, it leverages the DFS and allows any level of
> replication for robustness. Below is a proposition for solution 3.
> 1. The name node, when it starts up, needs some basic information. That
> information is not large and can easily be stored in a single block of DFS.
> We hard code the block location, using block id 0. Block 0 will contain the
> list of blocks that contain the name node metadata - not the metadata itself
> (file names, servers, blocks etc), just the list of blocks that contain it.
> With a block identified by 8 bytes, and 32 MB blocks, we can fit 256K block
> id's in block 0. 256K blocks of 32MB each can hold 8TB of metadata, which can
> map a large enough file system, so a single block of block_ids is sufficient.
> 2. The name node writes his state basically the same way as now: log file
> plus occasional full state. DFS needs to change to commit changes to open
> files while allowing continued writing to them, or else the log file wouldn't
> be valid on name server failure, before the file is closed.
> 3. The name node will use double buffering for its state, using blocks 0
> and 1. Starting with block 0, it writes its state, then a log of changes.
> When it's time to write a new state it writes it to node 1. The state
> includes a generation number, a single byte starting at 0, to enable the name
> server to identify the valid state. A CRC is written at the end of the block
> to mark its validity and completeness. The log file is identified by the same
> generation number as the state it relates to.
> 4. The log file will be limited to a single block as well. When that block
> fills up a new state is written. 32MB of transaction logs should suffice. If
> not, we could set aside a set of blocks, and set aside a few locations in the
> super-block (block 0/1) to store that set of block ids.
> 5. The super-block, the log and the metadata blocks may be exposed as read
> only files in reserved files in the DFS: /.metadata/* or something.
> 6. When a name nodes starts, it waits for data nodes to connect to it to
> report their blocks. It waits until it gets a report about blocks 0 and 1,
> from which it can continue to read its entire state. After that it continues
> normally.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira