[jira] [Commented] (HBASE-14495) TestHRegion#testFlushCacheWhileScanning goes zombie

stack (JIRA) Mon, 28 Sep 2015 22:06:49 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934624#comment-14934624
 ]


stack commented on HBASE-14495:
-------------------------------

Says

kalashnikov:hbase.git.commit stack$ python ./dev-support/findHangingTests.py 
https://builds.apache.org/job/PreCommit-HBASE-Build/15790//consoleText
Fetching the console output from the URL
Printing hanging tests
Hanging test : org.apache.hadoop.hbase.regionserver.TestHRegion
Hanging test : org.apache.hadoop.hbase.master.TestDistributedLogSplitting
Printing Failing tests
Failing test : org.apache.hadoop.hbase.http.TestHttpServerLifecycle
Failing test : org.apache.hadoop.hbase.client.TestSnapshotCloneIndependence
Failing test : 
org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS
Failing test : org.apache.hadoop.hbase.client.TestMobSnapshotCloneIndependence
Failing test : org.apache.hadoop.hbase.coprocessor.TestWALObserver

So, the TestHttpServerLifecycle is dodgy. Here it did this:

Flaked tests: 
org.apache.hadoop.hbase.http.TestHttpServerLifecycle.testStartedServerIsAlive(org.apache.hadoop.hbase.http.TestHttpServerLifecycle)
  Run 1: 
TestHttpServerLifecycle.testStartedServerIsAlive:73->HttpServerFunctionalTest.stop:195
 » TestTimedOut
  Run 2: PASS

I went on to disable this test altogether. See related HBASE-14430.

TestWALObserver failure was legit. In this patch I was letting empty WAL edits 
go through to CPs to up counters and metrics but looks like we don't want 
that... or at least someone went to trouble of testing we don't do that... so I 
removed this change from this patch.

The TestHRegion hangs are another instance of a mocked WAL but with mockito 
this time -- we were not closing latch inside mocked WAL. Fixed.

TestDistributedLogSplitting did not get the HBASE-14378 treatment. We hang in 
master.TestDistributedLogSplitting#testWorkerAbort.  Disabling. Made issue to 
reenable.

TestWALProcedureStoreOnHDFS is well-known failure to be addressed elsewhere. 
Will look at other failures too in other issues. They just seem to be flakey... 
they pass locally.







> TestHRegion#testFlushCacheWhileScanning goes zombie
> ---------------------------------------------------
>
>                 Key: HBASE-14495
>                 URL: https://issues.apache.org/jira/browse/HBASE-14495
>             Project: HBase
>          Issue Type: Sub-task
>          Components: test
>            Reporter: stack
>            Assignee: stack
>         Attachments: 14495.txt, 14495.txt, 14495v3.txt, 14495v6.txt, 
> 14495v7.txt
>
>
> This test goes zombie on us, most recently, here: 
> https://builds.apache.org/job/PreCommit-HBASE-Build/15744//console
> It does not fail on my internal rig runs nor locally on laptop when run in a 
> loop.
> Its hung up in close of the region:
> {code}
> "main" prio=10 tid=0x00007fc49800a800 nid=0x6053 in Object.wait() 
> [0x00007fc4a02c9000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000007d07c3478> (a java.lang.Object)
>       at 
> org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.waitForRead(MultiVersionConcurrencyControl.java:207)
>       - locked <0x00000007d07c3478> (a java.lang.Object)
>       at 
> org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.completeAndWait(MultiVersionConcurrencyControl.java:143)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2257)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2061)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2026)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2016)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1423)
>       at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1344)
>       - locked <0x00000007d07c34a8> (a java.lang.Object)
>       at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1295)
>       at 
> org.apache.hadoop.hbase.HBaseTestingUtility.closeRegionAndWAL(HBaseTestingUtility.java:352)
>       at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testFlushCacheWhileScanning(TestHRegion.java:3756)
> {code}
> It is waiting on mvcc to catch up.
> There is this comment at the point where we are hung:
>             // TODO: Lets see if we hang here, if there is a scenario where 
> an outstanding reader
>             // with a read point is in advance of this write point.
>             mvcc.completeAndWait(writeEntry);
> The above came in with HBASE-12751. The comment was added at v29:
> https://issues.apache.org/jira/secure/attachment/12754775/12751.rebased.v29.txt
> Looks like I added it so must have had predilection that this might be 
> dodgy... Let me take a look... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14495) TestHRegion#testFlushCacheWhileScanning goes zombie

Reply via email to