[ 
https://issues.apache.org/jira/browse/ACCUMULO-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved ACCUMULO-2963.
----------------------------------

    Resolution: Fixed

> ReplicationDriver daemon dies from RTE thrown out of BatchScanner
> -----------------------------------------------------------------
>
>                 Key: ACCUMULO-2963
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2963
>             Project: Accumulo
>          Issue Type: Bug
>          Components: replication
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Saw failure on build server where replication didn't happen in an integration 
> test. A tablet server was restarted as a part of this test.
> As the tabletserver was starting back up, the Master was trying to scan the 
> ReplicationTable. Before the tserver came up "completely" (not sure on 
> details), the Master starting getting repeated RuntimeExceptions
> {noformat}
> Exception in thread "Replication Driver" java.lang.RuntimeException: 
> org.apache.accumulo.core.client.AccumuloSecurityException: Error 
> DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown 
> security exception
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.hasNext(TabletServerBatchReaderIterator.java:182)
>         at 
> org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.removeCompleteRecords(RemoveCompleteReplicationRecords.java:124)
>         at 
> org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.run(RemoveCompleteReplicationRecords.java:88)
>         at 
> org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:94)
> Caused by: org.apache.accumulo.core.client.AccumuloSecurityException: Error 
> DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown 
> security exception
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:690)
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
>         at 
> org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
>         at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
>         at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
>         at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:660)
>         at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
>         at 
> org.apache.accumulo.core.client.impl.TimeoutTabletLocator.binRanges(TimeoutTabletLocator.java:104)
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:230)
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.processFailures(TabletServerBatchReaderIterator.java:302)
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.access$1400(TabletServerBatchReaderIterator.java:76)
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:386)
>         at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: ThriftSecurityException(user:!SYSTEM, code:null)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
>         at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
>         at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
>         ... 17 more
> {noformat}
> TabletServer was still in the process of starting, but must have already 
> obtained its lock (otherwise we couldn't have talked to it). It appears that 
> the exceptions starting repeatedly printing in the Master log before the 
> tserver hit it's main loop (lines 2414-2471 at f4024930).
> I think there may be a separate issue with the client receiving those 
> Exceptions before a tserver is "fully" up, but the Master thread needs to be 
> resilient against these exceptions bubbling up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to