[ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432540#comment-16432540 ]
Christopher Tubbs commented on ACCUMULO-4851: --------------------------------------------- I was interested in making a release candidate for 1.9.0 this week. Do you think this should be a blocker or should I proceed with a release candidate? > WAL recovery directory should be deleted before running LogSorter > ----------------------------------------------------------------- > > Key: ACCUMULO-4851 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4851 > Project: Accumulo > Issue Type: Bug > Components: tserver > Reporter: Josh Elser > Assignee: Josh Elser > Priority: Critical > Fix For: 1.9.0 > > > Noticed this one on a user's 1.7-ish system. > A number of tablets (~9) were unassigned and reported on the Monitor as > having failed to load. Digging into the exception, we could see the tablet > load failed due to a FileNotFoundException: > {noformat} > 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to > assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 > java.lang.RuntimeException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:640) > at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.FileNotFoundException: File does not > exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:590) > ... 9 more > Caused by: java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1823) > at > org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) > at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399) > at > org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113) > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) > ... 11 more > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet > xk;... reporting failure to master > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > {noformat} > Upon further investigation of the recovery directory in HDFS for this WAL, we > find the following: > {noformat} > $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000 > -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/data > -rw-r--r-- 3 accumulo hdfs 642 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/index > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001 > -rw-r--r-- 3 accumulo hdfs 8540196 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/data > -rw-r--r-- 3 accumulo hdfs 524 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/index > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002 > -rw-r--r-- 3 accumulo hdfs 8150879 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/data > -rw-r--r-- 3 accumulo hdfs 584 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/index > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003 > -rw-r--r-- 3 accumulo hdfs 8438021 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/data > -rw-r--r-- 3 accumulo hdfs 630 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/index > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004 > -rw-r--r-- 3 accumulo hdfs 4956770 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/data > -rw-r--r-- 3 accumulo hdfs 408 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/index > {noformat} > The strange thing here is that we both finished and failed markers for this > WAL's recovery directory. Given the timestamps, it appears that TServer1 > tried to do recovery, failed for some reason, and then TServer2 came along > and successfully completely LogSort. > However, when the merged-read of the sorted files came along, it treated the > failed flag as a sorted-chunk, and failed as such. > I think the simple solution would be to whack the recovery directory if it > exists before running the LogSorter. > Obligatory: I don't know if branches in Apache are verbatim to the fork I'm > looking at. Identifying all relevant branches is a necessary step here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)