Eric Newton created ACCUMULO-942:
------------------------------------

             Summary: accumulo should be more resilient in the face of NN 
failures
                 Key: ACCUMULO-942
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-942
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
            Reporter: Eric Newton
            Assignee: Keith Turner
            Priority: Critical


We experienced a NN failure on a large cluster.  The edit log was written to a 
RAIDed file system, but it did lose data sent to the edit log.  We suspect 
drivers making promises it did not keep.

This left Accumulo in a slightly corrupt state: a few references to files that 
were missing.

Also, we have attempted to have backup images of HDFS archived for disaster 
recovery.  This has not been helpful because Accumulo needs a highly consistent 
set of metadata, and a slightly older version of the file system confuses it.

One defense is to use snapshots.  However, this works at the table level, and 
it is hard to coordinate with the HDFS snapshot.

Another approach is to leave a short history of the files in the !METADATA 
table.  The Google paper hints at keeping historical information:

{quote}
We also store secondary information in the
METADATA table, including a log of all events per-
taining to each tablet (such as when a server begins
serving it). This information is helpful for debugging
and performance analysis.
{quote}

I think it would also be helpful for disaster recovery.  It may require the GC 
to be more sensitive to historical information about compactions.

Alternatively, we should start looking into high-availability NNs and 
bookkeeper high-performance logging.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to