[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

Duo Zhang (JIRA) Wed, 10 Jul 2019 08:15:22 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882178#comment-16882178
 ]


Duo Zhang commented on HBASE-22665:
-----------------------------------

Oh, the waitingRoll state is modified inside the consumeLock so there is no 
race...

Let's assume the writer is broken.
If syncFailed is executed first, then it will mark the writer as broken, so in 
the waitForSafePoint method, it will return at the writerBroken check.
If waitForSafePoint is executed first, then it will set the epochAndState to 
waitingRoll, so in the syncFailed method, we will wake up the threads waiting 
on readyForRollingCond.



> RegionServer abort failed when AbstractFSWAL.shutdown hang
> ----------------------------------------------------------
>
>                 Key: HBASE-22665
>                 URL: https://issues.apache.org/jira/browse/HBASE-22665
>             Project: HBase
>          Issue Type: Bug
>         Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>            Reporter: Yechao Chen
>            Priority: Major
>         Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 300000 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x00007f204282c600 nid=0x34aa waiting on condition [0x00007f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x00007f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF0000}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF0000} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF0000}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF0000} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

Reply via email to