subject:"\\\[jira\\\] \\\[Commented\\\] \\\(HBASE\\\-12457\\\) Regions in transition for a long time when CLOSE interleaves with a slow compaction"


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209485#comment-14209485
 ] 

Hudson commented on HBASE-12457:


FAILURE: Integrated in HBase-TRUNK #5772 (See 
[https://builds.apache.org/job/HBase-TRUNK/5772/])
HBASE-12457 Regions in transition for a long time when CLOSE interleaves with a 
slow compaction. (larsh: rev 231d3ee2adbfc32dfe4f7d7cd7a96ac33968520e)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209520#comment-14209520
 ] 

Hudson commented on HBASE-12457:


SUCCESS: Integrated in HBase-0.98 #674 (See 
[https://builds.apache.org/job/HBase-0.98/674/])
HBASE-12457 Regions in transition for a long time when CLOSE interleaves with a 
slow compaction. (larsh: rev 56af34831fc854c177697aefaf80d535996f87e8)
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209531#comment-14209531
 ] 

Hudson commented on HBASE-12457:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #642 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/642/])
HBASE-12457 Regions in transition for a long time when CLOSE interleaves with a 
slow compaction. (larsh: rev 56af34831fc854c177697aefaf80d535996f87e8)
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction

2014-11-13 Thread Dima Spivak (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209660#comment-14209660
 ] 

Dima Spivak commented on HBASE-12457:
-

[~lhofhansl], this commit looks to be [breaking test-compile on 
branch-1|https://builds.apache.org/job/HBase-1.0/462/console] and is [causing 5 
tests from TestRegionReplicas to fail on 
master|https://builds.apache.org/job/HBase-TRUNK/5772/testReport/] :(. FWIW, I 
reran on my local build machines and got the same errors.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction

2014-11-13 Thread ramkrishna.s.vasudevan (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209704#comment-14209704
 ] 

ramkrishna.s.vasudevan commented on HBASE-12457:


[~larsh]
{code}
writestate.wait(millis);
if (millis  0  EnvironmentEdgeManager.currentTime() - start = 
millis) {
  // if we waited once for compactions to finish, interrupt them, 
and try again
  if (LOG.isDebugEnabled()) {
LOG.debug(Waited for  + millis
  +  ms for compactions to finish on close. Interrupting 
  + currentCompactions.size() +  compactions.);
  }
  for (Thread t : currentCompactions.keySet()) {
// interrupt any current IO in the currently running 
compactions.
t.interrupt();
  }
  millis = 0;
}
{code}
In this code we interrupt all the threads and set the millis = 0.  So again the 
code goes to the outerloop and will once again wait for writeState.wait(0), 
expecting notify will happen. But what if by this time all the threads were 
interrupted and the notifyAll was also called.
{code}
finally {
if (wasStateSet) {
  synchronized (writestate) {
--writestate.compacting;
if (writestate.compacting = 0) {
  writestate.notifyAll();
}
  }
}
{code}
We will end up in infinite waiting?
I may be wrong here pls correct me.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209924#comment-14209924
 ] 

Andrew Purtell commented on HBASE-12457:


+1 on the addendum for fixing test annotation import paths 

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209952#comment-14209952
 ] 

Andrew Purtell commented on HBASE-12457:


I pushed the addendum to branch-1 and master.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209985#comment-14209985
 ] 

Andrew Purtell commented on HBASE-12457:


I can see a TestRegionReplicas hang. We are getting hung up on waiting for a 
HTable thread pool to terminate:
{noformat}
Thread-2297 prio=10 tid=0x7feee0d1c800 nid=0x6173 waiting on condition 
[0x7fee508c6000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x00078e04d4c8 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1490)
at 
org.apache.hadoop.hbase.regionserver.TestRegionReplicas.afterClass(TestRegionReplicas.java:107)
at 
org.apache.hadoop.hbase.regionserver.TestRegionReplicas.restartRegionServer(TestRegionReplicas.java:220)
at 
org.apache.hadoop.hbase.regionserver.TestRegionReplicas.testVerifySecondaryAbilityToReadWithOnFiles(TestRegionReplicas.java:421)
{noformat}

A worker thread in the HTable thread pool is hung up trying to get table state:

{noformat}
htable-pool53-t2 daemon prio=10 tid=0x7feea454c000 nid=0x566e waiting on 
condition [0x7feec0365000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1487)
- locked 0x00078cc03140 (a java.lang.Object)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1522)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1727)
- locked 0x00078cc03140 (a java.lang.Object)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTableState(ConnectionManager.java:2504)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isTableDisabled(ConnectionManager.java:894)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1064)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:289)
at 
org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:135)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:294)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:275)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Not sure how this relates to any compaction changes. At first glance it doesn't 
seem to.


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210020#comment-14210020
 ] 

Hudson commented on HBASE-12457:


FAILURE: Integrated in HBase-TRUNK #5773 (See 
[https://builds.apache.org/job/HBase-TRUNK/5773/])
Amend HBASE-12457 Regions in transition for a long time when CLOSE interleaves 
with a slow compaction; Test import fix (apurtell: rev 
f6d8cde1e4f67390a936e7bc9f8c70b65a808450)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210037#comment-14210037
 ] 

Andrew Purtell commented on HBASE-12457:


Well for whatever reason this change does trigger the above condition, due to 
some kind of timing change, because if I go back two commits, before this patch 
and the addendum, the test makes progress and completes.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210036#comment-14210036
 ] 

Lars Hofhansl commented on HBASE-12457:
---

Sorry about the build break on branch-1. I cherry-picked the patch. Usually I 
do a compile and run the relevant tests, but I spaced it this time.

The hang will not happen since we only notify *after* we set 
writestate.compacting (or writestate.flushing) back to false, so there is no 
race. I looked at that part :)

In the face of the test failures I am going to roll this back anyway, though.


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210044#comment-14210044
 ] 

Lars Hofhansl commented on HBASE-12457:
---

reverted from all branches... sorry about the noise

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210046#comment-14210046
 ] 

Lars Hofhansl commented on HBASE-12457:
---

[~apurtell], you mean the test condition, right? Or did you see it hanging 
specifically on that writestate.wait(...)?

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210054#comment-14210054
 ] 

Andrew Purtell commented on HBASE-12457:


I meant the minicluster shutdown sequencing issue.  Thanks for trying to get 
this in for .8 Lars.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210070#comment-14210070
 ] 

Hudson commented on HBASE-12457:


FAILURE: Integrated in HBase-1.0 #463 (See 
[https://builds.apache.org/job/HBase-1.0/463/])
Amend HBASE-12457 Regions in transition for a long time when CLOSE interleaves 
with a slow compaction; Test import fix (apurtell: rev 
9d2ad55cfa6108718d785b5e71ab10e9fb75a988)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction

2014-11-13 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210073#comment-14210073
 ] 

stack commented on HBASE-12457:
---

Thanks for backing out breaking change promptly.  Feel free to retry given you 
are watching the build results.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210159#comment-14210159
 ] 

Hudson commented on HBASE-12457:


SUCCESS: Integrated in HBase-1.0 #464 (See 
[https://builds.apache.org/job/HBase-1.0/464/])
Revert Amend HBASE-12457 Regions in transition for a long time when CLOSE 
interleaves with a slow compaction; Test import fix (larsh: rev 
880c7c35fc50f28ec3e072a4c62a348fc964e9e0)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
Revert HBASE-12457 Regions in transition for a long time when CLOSE 
interleaves with a slow compaction. (larsh: rev 
1861f9ce25bc8609629928a670fdf3566486ca25)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210160#comment-14210160
 ] 

Hudson commented on HBASE-12457:


FAILURE: Integrated in HBase-0.98 #675 (See 
[https://builds.apache.org/job/HBase-0.98/675/])
Revert HBASE-12457 Regions in transition for a long time when CLOSE 
interleaves with a slow compaction. (larsh: rev 
7f5f1570ce83c62ce9408701677994415b127b36)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210184#comment-14210184
 ] 

Hudson commented on HBASE-12457:


SUCCESS: Integrated in HBase-TRUNK #5774 (See 
[https://builds.apache.org/job/HBase-TRUNK/5774/])
Revert Amend HBASE-12457 Regions in transition for a long time when CLOSE 
interleaves with a slow compaction; Test import fix (larsh: rev 
9d634772fa12e16b86b0218802b2e38cacdfd528)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
Revert HBASE-12457 Regions in transition for a long time when CLOSE 
interleaves with a slow compaction. (larsh: rev 
c29318c038f0f310562dc8194506b504eae72c1b)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211269#comment-14211269
 ] 

Hudson commented on HBASE-12457:


SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #643 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/643/])
Revert HBASE-12457 Regions in transition for a long time when CLOSE 
interleaves with a slow compaction. (larsh: rev 
7f5f1570ce83c62ce9408701677994415b127b36)
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactionIO.java


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211809#comment-14211809
 ] 

Lars Hofhansl commented on HBASE-12457:
---

OK... What caused TestRegionReplicas to hang was the change that moved 
{{this.parent.writestate.writesEnabled = true;}} from SplitTransaction to 
HRegion.initializeRegionInternals.

That part is not needed anyway, it just looked like it would be more correct. 
Here's a patch for trunk that does passes TestRegionReplicas.


 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-combined-trunk.txt, 12457-minifix.txt, 12457.interrupt-v2.txt, 
 12457.interrupt.txt, HBASE-12457.patch, HBASE-12457_addendum.patch, 
 TestRegionReplicas-jstack.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction

2014-11-13 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211906#comment-14211906
]

Hadoop QA commented on HBASE-12457:
---

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12681485/12457-trunk-v3.txt
against trunk revision .
ATTACHMENT ID: 12681485

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 2 new
or modified tests.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1
warning messages.

{color:red}-1 checkstyle{color}. The applied patch generated
3787 checkstyle errors (more than the trunk's current 3786 errors).

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 lineLengths{color}. The patch does not introduce lines
longer than 100

{color:green}+1 site{color}. The mvn site goal succeeds with this patch.

{color:red}-1 core tests{color}. The patch failed these unit tests:

{color:red}-1 core zombie tests{color}. There are 1 zombie test(s):
at
org.apache.hadoop.hbase.coprocessor.TestMasterObserver.testRegionTransitionOperations(TestMasterObserver.java:1488)

Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/checkstyle-aggregate.html

Javadoc warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//artifact/patchprocess/patchJavadocWarnings.txt
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/11673//console

This message is automatically generated.

Regions in transition for a long time when CLOSE interleaves with a slow
compaction
---

Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt,
12457-combined-trunk.txt, 12457-minifix.txt, 12457-trunk-v3.txt,
12457.interrupt-v2.txt, 12457.interrupt.txt, HBASE-12457.patch,
HBASE-12457_addendum.patch, TestRegionReplicas-jstack.txt

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208293#comment-14208293
 ] 

Andrew Purtell commented on HBASE-12457:


I looked at the 0.98 combined patch. Changes lgtm, except:

- In HRegion#waitForFlushesAndCompactions, add debug level logging when issuing 
thread interrupts so we can determine if we are trying to interrupt but nothing 
subsequently happens.

- Use EnvironmentEdgeManager#currentTime (EEM#currentTimeMillis on 0.98) 
instead of System.currentTimeMillis

Nice test.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98.txt, 12457-minifix.txt, 
 12457.interrupt-v2.txt, 12457.interrupt.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208502#comment-14208502
 ] 

Andrew Purtell commented on HBASE-12457:


30s is ok. We're trying to limit the time clients will see NSRE for a parent 
region going offline in a split transaction so we shouldn't be too conservative 
with waiting here. 

Under what circumstances would we not want to clean up files in tmp from a 
failed or aborted compaction? They're  broken or redundant or both.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98.txt, 12457-minifix.txt, 
 12457.interrupt-v2.txt, 12457.interrupt.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction

[
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208494#comment-14208494
]

Lars Hofhansl commented on HBASE-12457:
---

bq. add debug level logging when issuing thread interrupts

Was *just* thinking that :) Will. The new Test also needs a license header.

You think 30s is good. Internally we found all compactions that do not have
this issue are aborted within 8s. Could make it a minute - although it's not
really hurting anything. The only part it (when interrupting the compaction)
doesn't do it cleaning up the files in tmp. Maybe it should do that...? (might
be a bit heard to distinguish this from other exception for which we presumably
do not want to clean up the tmp file... Or do we?)

Regions in transition for a long time when CLOSE interleaves with a slow
compaction
---

Attachments: 12457-combined-0.98.txt, 12457-minifix.txt,
12457.interrupt-v2.txt, 12457.interrupt.txt

Under heave load we have observed regions remaining in transition for 20
minutes when the master requests a close while a slow compaction is running.
The pattern is always something like this:
# RS starts a compaction
# HM request the region to be closed on this RS
# Compaction is not aborted for another 20 minutes
# The region is in transition and not usable.
In every case I tracked down so far the time between the requested CLOSE and
abort of the compaction is almost exactly 20 minutes, which is suspicious.
Of course part of the issue is having compactions that take over 20 minutes,
but maybe we can do better here.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208542#comment-14208542
 ] 

Lars Hofhansl commented on HBASE-12457:
---

Seeing an assertion failure now in the test... Checking.

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98.txt, 12457-minifix.txt, 
 12457.interrupt-v2.txt, 12457.interrupt.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208574#comment-14208574
 ] 

Lars Hofhansl commented on HBASE-12457:
---

Turns out that's because of HBASE-12454 (which I think was committed due to a 
misunderstanding)

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98.txt, 12457-minifix.txt, 
 12457.interrupt-v2.txt, 12457.interrupt.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208637#comment-14208637
 ] 

Andrew Purtell commented on HBASE-12457:


I pushed reverts for HBASE-12454 so that should be good now. 

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98.txt, 12457-minifix.txt, 
 12457.interrupt-v2.txt, 12457.interrupt.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction

[
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208708#comment-14208708
]

Lars Hofhansl commented on HBASE-12457:
---

We've seen this on two machines now. The wait on the other machine was also
close to 20m (18m to be precise).

Last question: Is 30s wait time before we interrupt good enough? The
compactions should cancel themselves (in our case we find that unless they hang
in the described way, they cancel themselves after no more than 8s). Could
maybe wait a minute too. Not sure.

Regions in transition for a long time when CLOSE interleaves with a slow
compaction
---

Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt,
12457-minifix.txt, 12457.interrupt-v2.txt, 12457.interrupt.txt

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208769#comment-14208769
 ] 

Andrew Purtell commented on HBASE-12457:


v2 patch lgtm

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-minifix.txt, 12457.interrupt-v2.txt, 12457.interrupt.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction


[ 
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209142#comment-14209142
 ] 

Andrew Purtell commented on HBASE-12457:


Let's commit this if you feel comfortable with it [~lhofhansl] so the next 
0.98.8 RC can get out the door. (Or we can try again later for .9)

 Regions in transition for a long time when CLOSE interleaves with a slow 
 compaction
 ---

 Key: HBASE-12457
 URL: https://issues.apache.org/jira/browse/HBASE-12457
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.7
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt, 
 12457-minifix.txt, 12457.interrupt-v2.txt, 12457.interrupt.txt


 Under heave load we have observed regions remaining in transition for 20 
 minutes when the master requests a close while a slow compaction is running.
 The pattern is always something like this:
 # RS starts a compaction
 # HM request the region to be closed on this RS
 # Compaction is not aborted for another 20 minutes
 # The region is in transition and not usable.
 In every case I tracked down so far the time between the requested CLOSE and 
 abort of the compaction is almost exactly 20 minutes, which is suspicious.
 Of course part of the issue is having compactions that take over 20 minutes, 
 but maybe we can do better here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction

[
https://issues.apache.org/jira/browse/HBASE-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209271#comment-14209271
]

Lars Hofhansl commented on HBASE-12457:
---

Lemme make a quick trunk patch. I think this is quite safe. If compaction
manage to cancel themselves within 30s it is functionally unchanged. And the
change in SplitTransaction seems correct to me as well.

Regions in transition for a long time when CLOSE interleaves with a slow
compaction
---

Attachments: 12457-combined-0.98-v2.txt, 12457-combined-0.98.txt,
12457-minifix.txt, 12457.interrupt-v2.txt, 12457.interrupt.txt

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12457) Regions in transition for a long time when CLOSE interleaves with a slow compaction