Josh Elser created ACCUMULO-2963:
------------------------------------
Summary: ReplicationDriver daemon dies from RTE thrown out of
BatchScanner
Key: ACCUMULO-2963
URL: https://issues.apache.org/jira/browse/ACCUMULO-2963
Project: Accumulo
Issue Type: Bug
Components: replication
Reporter: Josh Elser
Assignee: Josh Elser
Fix For: 1.7.0
Saw failure on build server where replication didn't happen in an integration
test. A tablet server was restarted as a part of this test.
As the tabletserver was starting back up, the Master was trying to scan the
ReplicationTable. Before the tserver came up "completely" (not sure on
details), the Master starting getting repeated RuntimeExceptions
{noformat}
Exception in thread "Replication Driver" java.lang.RuntimeException:
org.apache.accumulo.core.client.AccumuloSecurityException: Error
DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown
security exception
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.hasNext(TabletServerBatchReaderIterator.java:182)
at
org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.removeCompleteRecords(RemoveCompleteReplicationRecords.java:124)
at
org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.run(RemoveCompleteReplicationRecords.java:88)
at
org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:94)
Caused by: org.apache.accumulo.core.client.AccumuloSecurityException: Error
DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown
security exception
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:690)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
at
org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
at
org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
at
org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
at
org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:660)
at
org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
at
org.apache.accumulo.core.client.impl.TimeoutTabletLocator.binRanges(TimeoutTabletLocator.java:104)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:230)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.processFailures(TabletServerBatchReaderIterator.java:302)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.access$1400(TabletServerBatchReaderIterator.java:76)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:386)
at
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Thread.java:745)
Caused by: ThriftSecurityException(user:!SYSTEM, code:null)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
... 17 more
{noformat}
TabletServer was still in the process of starting, but must have already
obtained its lock (otherwise we couldn't have talked to it). It appears that
the exceptions starting repeatedly printing in the Master log before the
tserver hit it's main loop (lines 2414-2471 at f4024930).
I think there may be a separate issue with the client receiving those
Exceptions before a tserver is "fully" up, but the Master thread needs to be
resilient against these exceptions bubbling up.
--
This message was sent by Atlassian JIRA
(v6.2#6252)