[
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882178#comment-16882178
]
Duo Zhang commented on HBASE-22665:
-----------------------------------
Oh, the waitingRoll state is modified inside the consumeLock so there is no
race...
Let's assume the writer is broken.
If syncFailed is executed first, then it will mark the writer as broken, so in
the waitForSafePoint method, it will return at the writerBroken check.
If waitForSafePoint is executed first, then it will set the epochAndState to
waitingRoll, so in the syncFailed method, we will wake up the threads waiting
on readyForRollingCond.
> RegionServer abort failed when AbstractFSWAL.shutdown hang
> ----------------------------------------------------------
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
> Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
> Reporter: Yechao Chen
> Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png,
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png,
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to
> get sync result after 300000 ms for txid=36380334, WAL system stuck?"
>
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0
> tid=0x00007f204282c600 nid=0x34aa waiting on condition [0x00007f0fe044d000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007f18a49b2bb8> (a
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> at
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
> {color:#FF0000}at
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF0000} at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
> at
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
> at
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
> at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
> {color:#FF0000}at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF0000} at java.lang.Thread.run(Thread.java:745){color}
>
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)