[ http://issues.apache.org/jira/browse/HADOOP-90?page=comments#action_12413533 ]
Doug Cutting commented on HADOOP-90: ------------------------------------ The low-tech thing I've done that's saved me when the namenode dies is simply to have a cron entry that rsyncs the namenode's files to another node every few minutes. If the namenode dies, you lose at most a few minutes of computation. It's not perfect, and its not a long-term solution, but it is a pretty effective workaround in my experience. > DFS is succeptible to data loss in case of name node failure > ------------------------------------------------------------ > > Key: HADOOP-90 > URL: http://issues.apache.org/jira/browse/HADOOP-90 > Project: Hadoop > Type: Bug > Components: dfs > Versions: 0.1.0 > Reporter: Yoram Arnon > Attachments: multipleEditsDest.patch > > Currently, DFS name node stores its log and state in local files. > This has the disadvantage that a hardware failure of the name node causes a > total data loss. > Several approaches may be used to address this flaw: > 1. replicate the name server state files using copy or rsync once in a while, > either manually or using a cron job. > 2. set up secondary name servers and a protocol whereby the primary updates > the secondaries. In case of failure, a secondary can take over. > 3. store the state files as distributed, replicated files in the DFS itself. > The difficulty is that it becomes a bootstrap problem, where the name node > needs some information, typically stored in its state files, in order to read > those same state files. > solution 1 is fine for non critical systems, but for systems that need to > guarantee no data loss it's insufficient. > Solutions 2 and 3 both seem valid; 3 seems more elegant in that it doesn't > require an extra protocol, it leverages the DFS and allows any level of > replication for robustness. Below is a proposition for solution 3. > 1. The name node, when it starts up, needs some basic information. That > information is not large and can easily be stored in a single block of DFS. > We hard code the block location, using block id 0. Block 0 will contain the > list of blocks that contain the name node metadata - not the metadata itself > (file names, servers, blocks etc), just the list of blocks that contain it. > With a block identified by 8 bytes, and 32 MB blocks, we can fit 256K block > id's in block 0. 256K blocks of 32MB each can hold 8TB of metadata, which can > map a large enough file system, so a single block of block_ids is sufficient. > 2. The name node writes his state basically the same way as now: log file > plus occasional full state. DFS needs to change to commit changes to open > files while allowing continued writing to them, or else the log file wouldn't > be valid on name server failure, before the file is closed. > 3. The name node will use double buffering for its state, using blocks 0 > and 1. Starting with block 0, it writes its state, then a log of changes. > When it's time to write a new state it writes it to node 1. The state > includes a generation number, a single byte starting at 0, to enable the name > server to identify the valid state. A CRC is written at the end of the block > to mark its validity and completeness. The log file is identified by the same > generation number as the state it relates to. > 4. The log file will be limited to a single block as well. When that block > fills up a new state is written. 32MB of transaction logs should suffice. If > not, we could set aside a set of blocks, and set aside a few locations in the > super-block (block 0/1) to store that set of block ids. > 5. The super-block, the log and the metadata blocks may be exposed as read > only files in reserved files in the DFS: /.metadata/* or something. > 6. When a name nodes starts, it waits for data nodes to connect to it to > report their blocks. It waits until it gets a report about blocks 0 and 1, > from which it can continue to read its entire state. After that it continues > normally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira