[
https://issues.apache.org/jira/browse/ACCUMULO-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828132#comment-13828132
]
Luke Brassard commented on ACCUMULO-1830:
-----------------------------------------
We saw this behavior in a 1.5.x build and thought it would be useful to share
some insight into how we think this occurred for others that may have similar
issues.
Here's a breakdown of the events that took place:
{noformat:nopanel=true}
Timeline:
20:15 i.rf created // A
20:20 i.rf compacted away to j.rf
20:25 i.rf deleted and references updated // B
??:?? Minor compaction cleans up old walogs // C
??:?? CLUSTER CRASH
22:20 recovery
22:22 missing file (i.rf) reported
In pictures:
WAL-1 RootTablet
+---------------+ +---------------+
? | file:i.rf |<---[A]--->| file:i.rf |
/ | | +->| del:file:i.rf |
/ +---------------+ / +---------------+
[C] /
\ WAL-2 [B]
\ +---------------+ /
\+-X| del:file:i.rf | <-+
| |
+---------------+
A: 'i' reference written to RootTablet and WAL-1
B: 'i' delete marker written to RootTablet and WAL-2
- at this point, there is a delete marker in the RootTablet and in a WAL
C: After compaction, Accumulo cleans up WAL-1 and WAL-2 but is
interrupted by failure and WAL-1 is left behind
{noformat}
When restarted, the 'file:i.rf' record is recovered and added to the
RootTablet, which is telling Accumulo that a file exists, even though it was
deleted before the crash happened.
A lack of atomicity in the cleanup of walogs seems to be the cause of this
behavior.
> illegal state in RestartStressIT
> --------------------------------
>
> Key: ACCUMULO-1830
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1830
> Project: Accumulo
> Issue Type: Bug
> Components: master, tserver
> Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.4.4, 1.5.0
> Environment: on master, 135e67b68592f0d1c7ca69bac318a7ad3ed55831
> Reporter: Eric Newton
> Assignee: Eric Newton
> Priority: Critical
> Fix For: 1.4.5, 1.5.1
>
>
> {noformat}
> 2013-10-29 15:20:11,125 [state.MetaDataTableScanner] ERROR:
> java.lang.RuntimeException:
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
> found two locations for the same extent 1<: host:50867[14205a7c2a90003] and
> host:41255[14205a7c2a9000a]
> java.lang.RuntimeException:
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
> found two locations for the same extent 1<: host[14205a7c2a90003] and
> host:41255[14205a7c2a9000a]
> at
> org.apache.accumulo.server.master.state.MetaDataTableScanner.fetch(MetaDataTableScanner.java:189)
> at
> org.apache.accumulo.server.master.state.MetaDataTableScanner.next(MetaDataTableScanner.java:124)
> at
> org.apache.accumulo.server.master.state.MetaDataTableScanner.next(MetaDataTableScanner.java:1)
> at
> org.apache.accumulo.server.master.TabletGroupWatcher.run(TabletGroupWatcher.java:143)
> Caused by:
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
> found two locations for the same extent 1<: host:50867[14205a7c2a90003] and
> host:41255[14205a7c2a9000a]
> at
> org.apache.accumulo.server.master.state.MetaDataTableScanner.createTabletLocationState(MetaDataTableScanner.java:157)
> at
> org.apache.accumulo.server.master.state.MetaDataTableScanner.fetch(MetaDataTableScanner.java:185)
> ... 3 more
> {noformat}
> Here's where the test stopped
> {noformat}
> java.lang.IllegalStateException: Tablet has multiple locations : 1<
> at
> org.apache.accumulo.core.metadata.MetadataLocationObtainer.getMetadataLocationEntries(MetadataLocationObtainer.java:233)
> at
> org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablet(MetadataLocationObtainer.java:118)
> at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:462)
> at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:619)
> at
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:437)
> at
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:226)
> at
> org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:84)
> at
> org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:177)
> at
> org.apache.accumulo.test.VerifyIngest.verifyIngest(VerifyIngest.java:162)
> at
> org.apache.accumulo.test.functional.RestartStressIT.test(RestartStressIT.java:73)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1#6144)