[ 
http://issues.apache.org/jira/browse/HADOOP-90?page=comments#action_12413533 ] 

Doug Cutting commented on HADOOP-90:
------------------------------------

The low-tech thing I've done that's saved me when the namenode dies is simply 
to have a cron entry that rsyncs the namenode's files to another node every few 
minutes.  If the namenode dies, you lose at most a few minutes of computation.  
It's not perfect, and its not a long-term solution, but it is a pretty 
effective workaround in my experience.

> DFS is succeptible to data loss in case of name node failure
> ------------------------------------------------------------
>
>          Key: HADOOP-90
>          URL: http://issues.apache.org/jira/browse/HADOOP-90
>      Project: Hadoop
>         Type: Bug

>   Components: dfs
>     Versions: 0.1.0
>     Reporter: Yoram Arnon
>  Attachments: multipleEditsDest.patch
>
> Currently, DFS name node stores its log and state in local files.
> This has the disadvantage that a hardware failure of the name node causes a 
> total data loss. 
> Several approaches may be used to address this flaw:
> 1. replicate the name server state files using copy or rsync once in a while, 
> either manually or using a cron job.
> 2. set up secondary name servers and a protocol whereby the primary updates 
> the secondaries. In case of failure, a secondary can take over.
> 3. store the state files as distributed, replicated files in the DFS itself. 
> The difficulty is that it becomes a bootstrap problem, where the name node 
> needs some information, typically stored in its state files, in order to read 
> those same state files.
> solution 1 is fine for non critical systems, but for systems that need to 
> guarantee no data loss it's insufficient.
> Solutions 2 and 3 both seem valid; 3 seems more elegant in that it doesn't 
> require an extra protocol, it leverages the DFS and allows any level of 
> replication for robustness. Below is a proposition for  solution 3.
> 1.    The name node, when it starts up, needs some basic information. That 
> information is not large and can easily be stored in a single block of DFS. 
> We hard code the block location, using block id 0. Block 0 will contain the 
> list of blocks that contain the name node metadata - not the metadata itself 
> (file names, servers, blocks etc), just the list of blocks that contain it. 
> With a block identified by 8 bytes, and 32 MB blocks, we can fit 256K block 
> id's in block 0. 256K blocks of 32MB each can hold 8TB of metadata, which can 
> map a large enough file system, so a single block of block_ids is sufficient.
> 2.    The name node writes his state basically the same way as now: log file 
> plus occasional full state. DFS needs to change to commit changes to open 
> files while allowing continued writing to them, or else the log file wouldn't 
> be valid on name server failure, before the file is closed. 
> 3.    The name node will use double buffering for its state, using blocks 0 
> and 1. Starting with block 0, it writes its state, then a log of changes. 
> When it's time to write a new state it writes it to node 1. The state 
> includes a generation number, a single byte starting at 0, to enable the name 
> server to identify the valid state. A CRC is written at the end of the block 
> to mark its validity and completeness. The log file is identified by the same 
> generation number as the state it relates to. 
> 4.    The log file will be limited to a single block as well. When that block 
> fills up a new state is written. 32MB of transaction logs should suffice. If 
> not, we could set aside a set of blocks, and set aside a few locations in the 
> super-block (block 0/1) to store that set of block ids.
> 5.    The super-block, the log and the metadata blocks may be exposed as read 
> only files in reserved files in the DFS: /.metadata/* or something.
> 6.    When a name nodes starts, it waits for data nodes to connect to it to 
> report their blocks. It waits until it gets a report about blocks 0 and 1, 
> from which it can continue to read its entire state. After that it continues 
> normally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to