Keith Turner created ACCUMULO-2406:
--------------------------------------
Summary: GarbageCollectorIT.dontGCRootLog() fails
Key: ACCUMULO-2406
URL: https://issues.apache.org/jira/browse/ACCUMULO-2406
Project: Accumulo
Issue Type: Bug
Environment: 2ef2d88598f5e14f8f96b77fecca66dcd7196448
Reporter: Keith Turner
Assignee: Keith Turner
Fix For: 1.6.0
Saw the following failure while running ITs. The test is not using MiniDFS,
which is what caused the problem. Because minidfs was not used, recent walog
updates were lost. Below are my notes from investigating. To fix this I am
going to look into using RawLocalFS like I did w/ VolumeIT.
{noformat}
dontGCRootLog(org.apache.accumulo.test.functional.GarbageCollectorIT) Time
elapsed: 108.685 sec <<< ERROR!
java.lang.RuntimeException:
org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server
rd6ul-14723v.infosec.tycho.ncsc.mil:34130
at
org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:187)
at
org.apache.accumulo.test.functional.GarbageCollectorIT.dontGCRootLog(GarbageCollectorIT.java:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
Caused by: org.apache.accumulo.core.client.impl.AccumuloServerException: Error
on server rd6ul-14723v.infosec.tycho.ncsc.mil:34130
at
org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:285)
at
org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:84)
at
org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:177)
... 10 more
Caused by: org.apache.thrift.TApplicationException: Internal error processing
startScan
at
org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startScan(TabletClientService.java:226)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startScan(TabletClientService.java:202)
at
org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:399)
at
org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:277)
... 12 more
{noformat}
Looking in the tserver logs, I found this error was caused by a missing file.
{noformat}
TabletServer_1756174045.out:java.io.IOException: Failed to open
V/accumulo/tables/!0/table_info/A0000004.rf
{noformat}
The file was garbage collected
{noformat}
SimpleGarbageCollector_1083069032.out:2014-02-25 12:48:12,897
[gc.SimpleGarbageCollector] DEBUG: Deleting
V/accumulo/tables/!0/table_info/A0000004.rf
{noformat}
The file was created on another tablet server and compacted away. That tablet
server was killed by the test code right after doing the compaction. The
tserver is killed shortly after compaction 4 into 5.
{noformat}
TabletServer_1242447446.out:2014-02-25 12:48:07,256 [tserver.Tablet] DEBUG:
Starting MajC !0;~< (USER) [V/accumulo/tables/!0/table_info/A0000003.rf] -->
V/accumulo/tables/!0/table_info/A0000004.rf_tmp []
TabletServer_1242447446.out:2014-02-25 12:48:07,290 [tserver.Tablet]
TABLET_HIST: !0;~< MajC [V/accumulo/tables/!0/table_info/A0000003.rf] -->
V/accumulo/tables/!0/table_info/A0000004.rf
TabletServer_1242447446.out:2014-02-25 12:48:10,007 [tserver.Tablet] DEBUG:
Starting MajC !0;~< (USER) [V/accumulo/tables/!0/table_info/A0000004.rf] -->
V/accumulo/tables/!0/table_info/A0000005.rf_tmp []
TabletServer_1242447446.out:2014-02-25 12:48:10,044 [tserver.Tablet]
TABLET_HIST: !0;~< MajC [V/accumulo/tables/!0/table_info/A0000004.rf] -->
V/accumulo/tables/!0/table_info/A0000005.rf
{noformat}
So when the root tablet recovered it brought back old data. Looking in the root
tablet write ahead log, it only contained the following after the crash. This
is why A0000004.rf was brought back and A0000005.rf was forgotten.
{noformat}
COMPACTION_START 1 8
file:/local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.functional.GarbageCollectorIT_dontGCRootLog/accumulo/tables/+r/root_tablet/F000004m.rf
COMPACTION_FINISH 1 9
DEFINE_TABLET 1 9 +r<<
MANY_MUTATIONS 1 9
1 mutations:
~delfile:/local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.functional.GarbageCollectorIT_dontGCRootLog/accumulo/tables/!0/table_info/A0000002.rf
: [system]:23 [] <deleted>
MANY_MUTATIONS 1 9
1 mutations:
!0<
srv:flush [system]:24 [] 4
srv:lock [system]:24 []
tservers/host1:44035/zlock-0000000000$1446a28bfef0003
MANY_MUTATIONS 1 9
1 mutations:
!0;~
srv:flush [system]:25 [] 4
srv:lock [system]:25 []
tservers/host1:57501/zlock-0000000000$1446a28bfef0001
MANY_MUTATIONS 1 9
1 mutations:
!0<
srv:compact [system]:26 [] 4
srv:lock [system]:26 []
tservers/host1:44035/zlock-0000000000$1446a28bfef0003
MANY_MUTATIONS 1 9
1 mutations:
~delfile:/local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.functional.GarbageCollectorIT_dontGCRootLog/accumulo/tables/!0/table_info/A0000003.rf
: [system]:27 []
MANY_MUTATIONS 1 9
1 mutations:
!0;~
file:file:/local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.functional.GarbageCollectorIT_dontGCRootLog/accumulo/tables/!0/table_info/A0000003.rf
[system]:28 [] <deleted>
file:file:/local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.functional.GarbageCollectorIT_dontGCRootLog/accumulo/tables/!0/table_info/A0000004.rf
[system]:28 [] 578,5
srv:compact [system]:28 [] 4
last:1446a28bfef0001 [system]:28 [] host1:57501
srv:lock [system]:28 []
tservers/host1:57501/zlock-0000000000$1446a28bfef0001
{noformat}
below is some info about the root tablet recovery from the tserver logs
{noformat}
2014-02-25 12:49:26,353 [tserver.Tablet] INFO : Starting Write-Ahead Log
recovery for +r<<
2014-02-25 12:49:26,356 [tserver.TabletServer] INFO : Looking for
V/accumulo/recovery/770c5d8a-c598-4e3c-8b22-9ce27eee5f40/finished
2014-02-25 12:49:26,360 [log.SortedLogRecovery] INFO : Looking at mutations
from V/accumulo/recovery/770c5d8a-c598-4e3c-8b22-9ce27eee5f40 for +r<<
2014-02-25 12:49:26,376 [log.SortedLogRecovery] DEBUG: Found tid, seq 1 1
2014-02-25 12:49:26,378 [log.SortedLogRecovery] DEBUG: minor compaction into
V/accumulo/tables/+r/root_tablet/F000004g.rf finished, but was still in the
METADATA
2014-02-25 12:49:26,378 [log.SortedLogRecovery] DEBUG: minor compaction into
V/accumulo/tables/+r/root_tablet/F000004i.rf finished, but was still in the
METADATA
2014-02-25 12:49:26,378 [log.SortedLogRecovery] DEBUG: minor compaction into
V/accumulo/tables/+r/root_tablet/F000004k.rf finished, but was still in the
METADATA
2014-02-25 12:49:26,378 [log.SortedLogRecovery] DEBUG: minor compaction into
V/accumulo/tables/+r/root_tablet/F000004m.rf finished, but was still in the
METADATA
2014-02-25 12:49:26,382 [log.SortedLogRecovery] INFO : Scanning for mutations
starting at sequence number 8 for tid 1
2014-02-25 12:49:26,386 [log.SortedLogRecovery] INFO : Recovery complete for
+r<< using V/accumulo/recovery/770c5d8a-c598-4e3c-8b22-9ce27eee5f40
2014-02-25 12:49:26,387 [tserver.Tablet] INFO : Write-Ahead Log recovery
complete for +r<< (6 mutations applied, 13 entries created)
2014-02-25 12:49:26,396 [tserver.Tablet] TABLET_HIST: +r<< opened
{noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)