[ 
https://issues.apache.org/jira/browse/HBASE-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351640#comment-16351640
 ] 

Duo Zhang commented on HBASE-19927:
-----------------------------------

Wait for the region server to crash before reading, tried several times, all 
passed.

And I think the problem is that, we enter the normal shutdown first, scheduled 
region closing, and then wait on all the regions to be closed. And during the 
closing, master moves the WAL directory so we will fail when writing flush 
marker, and AsyncFSWAL will trigger a log roll. And the LogRoller will get an 
exception when creating new log file and then call rs.abort.

I've already done a hack in AsyncFSWAL, that when we get 'Parent directory 
doesn't exist:' we will fail all the pending requests. But here it is not 
enough. We can only fail some of the flush maker writing, if there are still 
flush requests that write flush marker after the failure of log roll, it will 
be stuck for ever because we will not schedule a consumer to write it out, so 
we will not trigger another log roll and can not trigger the hack again...

A possible solution maybe that, when s log roller finds that it could not roll 
a WAL, before aborting the RS, shutdown the WAL directly first...

> TestFullLogReconstruction flakey
> --------------------------------
>
>                 Key: HBASE-19927
>                 URL: https://issues.apache.org/jira/browse/HBASE-19927
>             Project: HBase
>          Issue Type: Sub-task
>          Components: wal
>            Reporter: stack
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: HBASE-19927.patch, js, out
>
>
> Fails pretty frequently in hadoopqa builds.
> There is a recent hang in 
> org.apache.hadoop.hbase.TestFullLogReconstruction.tearDownAfterClass(TestFullLogReconstruction.java:68)
> In here... 
> https://builds.apache.org/job/PreCommit-HBASE-Build/11363/testReport/org.apache.hadoop.hbase/TestFullLogReconstruction/org_apache_hadoop_hbase_TestFullLogReconstruction/
> ... see here.
> Thread 1250 (RS_CLOSE_META-edd281aedb18:59863-0):
>   State: TIMED_WAITING
>   Blocked count: 92
>   Waited count: 278
>   Stack:
>     java.lang.Object.wait(Native Method)
>     
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:133)
>     
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:718)
>     
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:605)
>     
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:154)
>     
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:81)
>     
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
>     
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2356)
>     
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2328)
>     
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2319)
>     org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1531)
>     org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437)
>     
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
>     org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>     
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     java.lang.Thread.run(Thread.java:748)
> We missed a signal? We need to do an interrupt? The log is not all there in 
> hadoopqa builds so hard to see all that is going on. This test is not in the 
> flakey set either....



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to