[jira] [Commented] (ACCUMULO-4092) metadata table corruption on recovery

Eric Newton (JIRA) Tue, 29 Dec 2015 08:18:07 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074041#comment-15074041
 ]


Eric Newton commented on ACCUMULO-4092:
---------------------------------------

Talking to [~kturner] and he pointed out one really good way to detect this 
early: use conditional mutations to verify the state of the metadata table 
before making updates.


> metadata table corruption on recovery
> -------------------------------------
>
>                 Key: ACCUMULO-4092
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4092
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.6.4
>         Environment: large production system, 1.6.2 with local patches, 
> hadoop 2.2
>            Reporter: Eric Newton
>
> I suspect that we are getting metadata table corruption on WAL recovery. 
> There have been several hints that this has occurred over the past 2 years, 
> but I have not had strong evidence for it until today.
> A large production cluster was recently upgraded to 1.6.4. Upon shutdown, it 
> had several consistency check failures.
> When a tablet is unloaded, it double-checks the entries for the tablet held 
> in memory against the metadata for the tablet. When the production system was 
> restarted for the upgrade, this check failed for several tablets. In 
> particular, there were file references for the tablet, that did not exist in 
> memory.
> This particular system has a very large table which is organized by date. 
> Almost all of the tablets that failed the check occurred on the same date. If 
> the metadata tablet for those tablets was recovered on that date, and there 
> is some bug recovering the WAL entries, they would have affected multiple 
> tablets on the same day.
> After searching around the logs, we did find that the metadata tablet for the 
> corrupt tablets did experience a recovery on the date in question.  
> Unfortunately, the WAL files were GC'd many weeks ago.
> We need more information to track down the bug. Some possible ways to get 
> this information include:
> 1) add periodic consistency checks: It's simple, and would detect problems 
> earlier. In a test environment, we might be able to keep all the archived 
> WALs.
> 2) upon metadata tablet recovery, the master could issue a request for 
> consistency checks for the affected tablets.  If checks fail, the recovery 
> logs could be archived.
> 3) add metadata splits to the long-running tests which would add many more 
> metadata tablet recoveries
> I suspect the bug is subtle, and may not cause data loss, since we don't see 
> data loss in continuous ingest tests.  But that doesn't mean that deleted 
> data isn't being returned to a table, since the CI test does not delete data.
> The uptime for this system is measured in months and includes several hundred 
> nodes. The metadata tablet is spread over most of the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-4092) metadata table corruption on recovery

Reply via email to