[ 
https://issues.apache.org/jira/browse/ACCUMULO-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834100#comment-13834100
 ] 

Eric Newton edited comment on ACCUMULO-1940 at 11/29/13 3:02 PM:
-----------------------------------------------------------------

I haven't been able to figure out how the DatafileManager got into an 
inconsistent state. bringMinorCompactionOnline(...) appears to have completed 
successfully which means that we should have correctly placed an entry for this 
flush file into the Map. Unless I missed it, there were no other compactions 
with this tablet either that could have altered the collection of files.


was (Author: elserj):
I haven't been able to figure out how the DatafileManager got into a consistent 
state. bringMinorCompactionOnline(...) appears to have completed successfully 
which means that we should have correctly placed an entry for this flush file 
into the Map. Unless I missed it, there were no other compactions with this 
tablet either that could have altered the collection of files.

> Data file in !METADATA differs from in memory data
> --------------------------------------------------
>
>                 Key: ACCUMULO-1940
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1940
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.5.0
>            Reporter: Josh Elser
>
> Found during CI run with agitation.
> Got the first two error messages 5 times (assuming in a retry on failure 
> block):
> {noformat}
> Failed to do close consistency check for tablet c;79d0ab;7870a
>       java.lang.RuntimeException: Data file in !METADATA differ from in 
> memory data c;79d0ab;7870a  {/t-0005h1j/A0005n8k.rf=797350457 19198312, 
> /t-0005h1j/C0005skm.rf=798078368 19322025, /t-0005h1j/C0005tet.rf=89783168 
> 2196349, /t-0005h1j/C0005u20.rf=90979448 2227972, 
> /t-0005h1j/F0005u0v.rf=23410023 582233, /t-0005h1j/F0005u2p.rf=21958551 
> 547159, /t-0005h1j/F0005u3g.rf=14395121 358893}  
> {/t-0005h1j/A0005n8k.rf=797350457 19198312, /t-0005h1j/C0005skm.rf=798078368 
> 19322025, /t-0005h1j/C0005tet.rf=89783168 2196349, 
> /t-0005h1j/C0005u20.rf=90979448 2227972, /t-0005h1j/F0005u2p.rf=21958551 
> 547159, /t-0005h1j/F0005u3g.rf=14395121 358893}
>               at 
> org.apache.accumulo.server.tabletserver.Tablet.closeConsistencyCheck(Tablet.java:2847)
>               at 
> org.apache.accumulo.server.tabletserver.Tablet.completeClose(Tablet.java:2780)
>               at 
> org.apache.accumulo.server.tabletserver.Tablet.close(Tablet.java:2658)
>               at 
> org.apache.accumulo.server.tabletserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2357)
>               at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>               at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>               at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>               at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>               at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>               at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>               at java.lang.Thread.run(Thread.java:744)
> {noformat}
> Then, we logged that we failed the consistency check
> {noformat}
> Consistency check fails, retrying java.lang.RuntimeException: Failed to do 
> close consistency check for tablet c;79d0ab;7870a
> {noformat}
> In the end, we gave up and closed it anyways.
> {noformat}
> Tablet closed consistency check has failed for c;79d0ab;7870a giving up and 
> closing
> {noformat}
> Before all of this happened, we tried to bring this tablet online after a 
> failure on a new tserver. During the minc as part of the recovery process, we 
> failed to get the lease on the .rf_tmp file we tried to create. We failed 
> this a couple of times, but eventually got the tmp file we needed and the 
> recovery process completed and we could bring the tablet online. The 
> difference between the in-memory version and the !METADATA version was this 
> one flushed rfile that we created during this recovery process.
> The problem eventually fixed itself because the tablet was migrated to a 
> different server and we just took what was (correctly) in the !METADATA table.
> There still is an unknown issue of how we missed the flush RFile in the 
> DatafileManager's copy.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to