[ 
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444394#comment-16444394
 ] 

Keith Turner commented on ACCUMULO-4851:
----------------------------------------

This is not a duplicate of #432

> WAL recovery directory should be deleted before running LogSorter
> -----------------------------------------------------------------
>
>                 Key: ACCUMULO-4851
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as 
> having failed to load. Digging into the exception, we could see the tablet 
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>         at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:640)
>         at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)
>         at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at 
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>         at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>         at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>         at 
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>         at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:590)
>         ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1823)
>         at 
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>         at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>         at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
>         at 
> org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)
>         at 
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>         at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>         ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet 
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we 
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:12 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000
> -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/data
> -rw-r--r--   3 accumulo hdfs        642 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001
> -rw-r--r--   3 accumulo hdfs    8540196 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/data
> -rw-r--r--   3 accumulo hdfs        524 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002
> -rw-r--r--   3 accumulo hdfs    8150879 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/data
> -rw-r--r--   3 accumulo hdfs        584 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003
> -rw-r--r--   3 accumulo hdfs    8438021 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/data
> -rw-r--r--   3 accumulo hdfs        630 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004
> -rw-r--r--   3 accumulo hdfs    4956770 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/data
> -rw-r--r--   3 accumulo hdfs        408 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/index
> {noformat}
>  The strange thing here is that we both finished and failed markers for this 
> WAL's recovery directory. Given the timestamps, it appears that TServer1 
> tried to do recovery, failed for some reason, and then TServer2 came along 
> and successfully completely LogSort.
> However, when the merged-read of the sorted files came along, it treated the 
> failed flag as a sorted-chunk, and failed as such.
> I think the simple solution would be to whack the recovery directory if it 
> exists before running the LogSorter.
> Obligatory: I don't know if branches in Apache are verbatim to the fork I'm 
> looking at. Identifying all relevant branches is a necessary step here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to