Josh Elser created ACCUMULO-4851:
------------------------------------

             Summary: WAL recovery directory should be deleted before running 
LogSorter
                 Key: ACCUMULO-4851
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
            Reporter: Josh Elser
            Assignee: Josh Elser
             Fix For: 1.9.0


Noticed this one on a user's 1.7-ish system.

A number of tablets (~9) were unassigned and reported on the Monitor as having 
failed to load. Digging into the exception, we could see the tablet load failed 
due to a FileNotFoundException:
{noformat}
2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: 
File does not exist: 
/accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:640)
        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)
        at 
org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
        at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at 
org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
        at 
org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:590)
        ... 9 more
Caused by: java.io.FileNotFoundException: File does not exist: 
/accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1823)
        at 
org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
        at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
        at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
        at 
org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)
        at 
org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
        ... 11 more
2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
java.io.FileNotFoundException: File does not exist: 
/accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
xk;... reporting failure to master
2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet load 
in 600.00 seconds
{noformat}
Upon further investigation of the recovery directory in HDFS for this WAL, we 
find the following:
{noformat}
$ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
-rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:12 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
-rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:09 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000
-rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/data
-rw-r--r--   3 accumulo hdfs        642 2018-04-06 22:09 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/index
drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001
-rw-r--r--   3 accumulo hdfs    8540196 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/data
-rw-r--r--   3 accumulo hdfs        524 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/index
drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002
-rw-r--r--   3 accumulo hdfs    8150879 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/data
-rw-r--r--   3 accumulo hdfs        584 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/index
drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003
-rw-r--r--   3 accumulo hdfs    8438021 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/data
-rw-r--r--   3 accumulo hdfs        630 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/index
drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004
-rw-r--r--   3 accumulo hdfs    4956770 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/data
-rw-r--r--   3 accumulo hdfs        408 2018-04-06 22:10 
accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/index
{noformat}
 The strange thing here is that we both finished and failed markers for this 
WAL's recovery directory. Given the timestamps, it appears that TServer1 tried 
to do recovery, failed for some reason, and then TServer2 came along and 
successfully completely LogSort.

However, when the merged-read of the sorted files came along, it treated the 
failed flag as a sorted-chunk, and failed as such.

I think the simple solution would be to whack the recovery directory if it 
exists before running the LogSorter.

Obligatory: I don't know if branches in Apache are verbatim to the fork I'm 
looking at. Identifying all relevant branches is a necessary step here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to