Eric Newton created ACCUMULO-942:
------------------------------------
Summary: accumulo should be more resilient in the face of NN
failures
Key: ACCUMULO-942
URL: https://issues.apache.org/jira/browse/ACCUMULO-942
Project: Accumulo
Issue Type: Bug
Components: tserver
Reporter: Eric Newton
Assignee: Keith Turner
Priority: Critical
We experienced a NN failure on a large cluster. The edit log was written to a
RAIDed file system, but it did lose data sent to the edit log. We suspect
drivers making promises it did not keep.
This left Accumulo in a slightly corrupt state: a few references to files that
were missing.
Also, we have attempted to have backup images of HDFS archived for disaster
recovery. This has not been helpful because Accumulo needs a highly consistent
set of metadata, and a slightly older version of the file system confuses it.
One defense is to use snapshots. However, this works at the table level, and
it is hard to coordinate with the HDFS snapshot.
Another approach is to leave a short history of the files in the !METADATA
table. The Google paper hints at keeping historical information:
{quote}
We also store secondary information in the
METADATA table, including a log of all events per-
taining to each tablet (such as when a server begins
serving it). This information is helpful for debugging
and performance analysis.
{quote}
I think it would also be helpful for disaster recovery. It may require the GC
to be more sensitive to historical information about compactions.
Alternatively, we should start looking into high-availability NNs and
bookkeeper high-performance logging.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira