[
https://issues.apache.org/jira/browse/ACCUMULO-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074041#comment-15074041
]
Eric Newton commented on ACCUMULO-4092:
---------------------------------------
Talking to [~kturner] and he pointed out one really good way to detect this
early: use conditional mutations to verify the state of the metadata table
before making updates.
> metadata table corruption on recovery
> -------------------------------------
>
> Key: ACCUMULO-4092
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4092
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.6.4
> Environment: large production system, 1.6.2 with local patches,
> hadoop 2.2
> Reporter: Eric Newton
>
> I suspect that we are getting metadata table corruption on WAL recovery.
> There have been several hints that this has occurred over the past 2 years,
> but I have not had strong evidence for it until today.
> A large production cluster was recently upgraded to 1.6.4. Upon shutdown, it
> had several consistency check failures.
> When a tablet is unloaded, it double-checks the entries for the tablet held
> in memory against the metadata for the tablet. When the production system was
> restarted for the upgrade, this check failed for several tablets. In
> particular, there were file references for the tablet, that did not exist in
> memory.
> This particular system has a very large table which is organized by date.
> Almost all of the tablets that failed the check occurred on the same date. If
> the metadata tablet for those tablets was recovered on that date, and there
> is some bug recovering the WAL entries, they would have affected multiple
> tablets on the same day.
> After searching around the logs, we did find that the metadata tablet for the
> corrupt tablets did experience a recovery on the date in question.
> Unfortunately, the WAL files were GC'd many weeks ago.
> We need more information to track down the bug. Some possible ways to get
> this information include:
> 1) add periodic consistency checks: It's simple, and would detect problems
> earlier. In a test environment, we might be able to keep all the archived
> WALs.
> 2) upon metadata tablet recovery, the master could issue a request for
> consistency checks for the affected tablets. If checks fail, the recovery
> logs could be archived.
> 3) add metadata splits to the long-running tests which would add many more
> metadata tablet recoveries
> I suspect the bug is subtle, and may not cause data loss, since we don't see
> data loss in continuous ingest tests. But that doesn't mean that deleted
> data isn't being returned to a table, since the CI test does not delete data.
> The uptime for this system is measured in months and includes several hundred
> nodes. The metadata tablet is spread over most of the cluster.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)