[
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christopher Tubbs updated ACCUMULO-4851:
----------------------------------------
Fix Version/s: (was: 1.9.0)
> WAL recovery directory should be deleted before running LogSorter
> -----------------------------------------------------------------
>
> Key: ACCUMULO-4851
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Critical
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as
> having failed to load. Digging into the exception, we could see the tablet
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException:
> java.io.FileNotFoundException: File does not exist:
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:640)
> at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)
> at
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
> at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> at
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
> at
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
> at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:590)
> ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist:
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1823)
> at
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
> at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
> at
> org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)
> at
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
> at
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
> ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException:
> java.io.FileNotFoundException: File does not exist:
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000
> -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/data
> -rw-r--r-- 3 accumulo hdfs 642 2018-04-06 22:09
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/index
> drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001
> -rw-r--r-- 3 accumulo hdfs 8540196 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/data
> -rw-r--r-- 3 accumulo hdfs 524 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/index
> drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002
> -rw-r--r-- 3 accumulo hdfs 8150879 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/data
> -rw-r--r-- 3 accumulo hdfs 584 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/index
> drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003
> -rw-r--r-- 3 accumulo hdfs 8438021 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/data
> -rw-r--r-- 3 accumulo hdfs 630 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/index
> drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004
> -rw-r--r-- 3 accumulo hdfs 4956770 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/data
> -rw-r--r-- 3 accumulo hdfs 408 2018-04-06 22:10
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/index
> {noformat}
> The strange thing here is that we both finished and failed markers for this
> WAL's recovery directory. Given the timestamps, it appears that TServer1
> tried to do recovery, failed for some reason, and then TServer2 came along
> and successfully completely LogSort.
> However, when the merged-read of the sorted files came along, it treated the
> failed flag as a sorted-chunk, and failed as such.
> I think the simple solution would be to whack the recovery directory if it
> exists before running the LogSorter.
> Obligatory: I don't know if branches in Apache are verbatim to the fork I'm
> looking at. Identifying all relevant branches is a necessary step here.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)