[jira] [Created] (ACCUMULO-3234) Merge fails to complete

Josh Elser (JIRA) Tue, 14 Oct 2014 15:58:47 -0700

Josh Elser created ACCUMULO-3234:
------------------------------------

             Summary: Merge fails to complete
                 Key: ACCUMULO-3234
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3234
             Project: Accumulo
          Issue Type: Bug
          Components: master, tserver
    Affects Versions: 1.6.1
            Reporter: Josh Elser
            Assignee: Josh Elser
             Fix For: 1.6.2, 1.7.0



Running the Shard module of Randomwalk.

Noticed that some driver code around the test had timed out:

{noformat}
14 02:44:21,839 [shard.Merge] DEBUG: merging 
ST_index_hostname_24487_1413254628126
14 02:46:22,129 [impl.ThriftTransportPool] WARN : Thread 
"org.apache.accumulo.test.randomwalk.Framework" stuck on IO  to master:9999 (0) 
for at least 120050 ms
14 02:49:21,807 [randomwalk.Module] WARN : Node 
org.apache.accumulo.test.randomwalk.shard.Merge has been running for 300.003 
seconds. You may want to look into it.
{noformat}

A few seconds later, the master complains about the tabletserver failing to 
unload some tablets for the table we're merging (presumably to do the merge), 
and the TabletServer hits an NPE trying to run a MinC (line numbers screwed up, 
so omitting the stack traces). We start down trying to get a new filename

{code}
FileRef newMapfileLocation = getNextMapFilename(mergeFile == null ? "F" : "M");
{code}

Which ultimately threw an AccessControlException from Hadoop because it tried 
to write to the root of HDFS instead of the provided directory (issue #1).

This also cause an NPE in the finally block for the MinorCompactionTask on 
(issue #2)

{code}
minorCompaction.data("numEntries", Long.toString(this.stats.getNumEntries()));
{code}

Since all of this (+12hrs), the client is still blocked on the merge to 
complete, the tserver is stuck in a loop trying to unload the tablets

{noformat}
2014-10-14 22:53:48,761 [tserver.Tablet] DEBUG: initiateClose(saveState=true 
queueMinC=false disableWrites=false) 2w;00001a;000018
2014-10-14 22:53:48,761 [tserver.TabletServer] DEBUG: Failed to unload tablet 
2w;00001a;000018... it was alread closing or closed : Tablet 2w;00001a;000018 
already closing
{noformat}

the master is stuck in a loop waiting for those tablets to go offline (I think)

{noformat}
2014-10-14 22:54:40,680 [state.MergeStats] INFO : Computing next merge state 
for 2w<< which is presently WAITING_FOR_OFFLINE isDelete : false
2014-10-14 22:54:40,680 [state.MergeStats] INFO : 21 tablets are chopped, 0 are 
offline 2w<<
2014-10-14 22:54:40,680 [state.MergeStats] INFO : Waiting for 0 unassigned 
tablets to be 21 2w<<
2014-10-14 22:54:40,680 [master.Master] DEBUG: [Normal Tablets] sleeping for 
60.00 seconds
2014-10-14 22:54:40,684 [master.Master] DEBUG: Finished gathering information 
from 1 servers in 0.00 seconds
2014-10-14 22:54:40,685 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
with 0 migrations
2014-10-14 22:54:40,685 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
with 0 migrations
2014-10-14 22:54:40,685 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
with 0 migrations
2014-10-14 22:54:40,686 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
with 0 migrations
2014-10-14 22:54:40,686 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
with 0 migrations
2014-10-14 22:54:40,686 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
with 0 migrations
2014-10-14 22:54:40,804 [state.ZooTabletStateStore] DEBUG: Returning root 
tablet state: 
+r<<@(null,hostname:9997[1490be48ae20006],hostname:9997[1490be48ae20006])
2014-10-14 22:54:40,804 [master.Master] DEBUG: [Root Table]: scan time 0.00 
seconds
2014-10-14 22:54:40,804 [master.Master] DEBUG: [Root Table] sleeping for 60.00 
seconds
2014-10-14 22:54:40,830 [master.Master] DEBUG: [Metadata Tablets]: scan time 
0.04 seconds
2014-10-14 22:54:40,830 [master.Master] DEBUG: [Metadata Tablets] sleeping for 
60.00 seconds
2014-10-14 22:54:40,830 [master.Master] DEBUG: mergeInfo overlaps: 2w;000002< 
true
2014-10-14 22:54:40,831 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000004;000002 true
2014-10-14 22:54:40,831 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000006;000004 true
2014-10-14 22:54:40,832 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000008;000006 true
2014-10-14 22:54:40,832 [master.Master] DEBUG: mergeInfo overlaps: 
2w;00000a;000008 true
2014-10-14 22:54:40,833 [master.Master] DEBUG: mergeInfo overlaps: 
2w;00000c;00000a true
2014-10-14 22:54:40,833 [master.Master] DEBUG: mergeInfo overlaps: 
2w;00000e;00000c true
2014-10-14 22:54:40,834 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000010;00000e true
2014-10-14 22:54:40,834 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000012;000010 true
2014-10-14 22:54:40,835 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000014;000012 true
2014-10-14 22:54:40,835 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000016;000014 true
2014-10-14 22:54:40,835 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000018;000016 true
2014-10-14 22:54:40,836 [master.Master] DEBUG: mergeInfo overlaps: 
2w;00001a;000018 true
2014-10-14 22:54:40,836 [master.Master] DEBUG: mergeInfo overlaps: 
2w;00001c;00001a true
2014-10-14 22:54:40,836 [master.Master] DEBUG: mergeInfo overlaps: 
2w;00001e;00001c true
2014-10-14 22:54:40,837 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000020;00001e true
2014-10-14 22:54:40,837 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000022;000020 true
2014-10-14 22:54:40,837 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000024;000022 true
2014-10-14 22:54:40,838 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000026;000024 true
2014-10-14 22:54:40,838 [master.Master] DEBUG: mergeInfo overlaps: 
2w;000028;000026 true
2014-10-14 22:54:40,839 [master.Master] DEBUG: mergeInfo overlaps: 2w<;000028 
true
2014-10-14 22:54:40,839 [master.Master] DEBUG: [Normal Tablets]: scan time 0.04 
seconds
2014-10-14 22:54:40,839 [master.EventCoordinator] INFO : [Normal Tablets]: 21 
tablets unloaded
{noformat}

And the fate op is still locked (combined issue #3)

{noformat}
txid: 3a638f9642050e1a  status: IN_PROGRESS         op: TableRangeOp     
locked: [W:2w, R:+default] locking: []              top: TableRangeOpWait
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (ACCUMULO-3234) Merge fails to complete

Reply via email to