[
https://issues.apache.org/jira/browse/HBASE-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaolin Ha updated HBASE-25905:
-------------------------------
Description:
We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met
the problem than RS can not exit completely for several hours util manual
interventions.
We find the problem of RS exiting incompletely is relevant to WAL stuck
problems, as mentioned in HBASE-25631. When the blockOnSync timeout in flushing
cache, the RS will kill itself, but only stream broken problems can make sync
fail and broken the WAL, there is no socket timeout in fan out OutputStream, as
a result, the output flush can stuck for a unlimited long time.
The two jstacks below show that the regionserver thread can waiting unlimitedly
in both
AsyncFSWAL#waitForSafePoint()
{code:java}
"regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28
prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition
[0x00007fd7ec687000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fddc4a5ac68> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
at
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
at
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
at java.lang.Thread.run(Thread.java:745)
{code}
and AbstractFSWAL.rollWriterLock(when logRoller stuck in
AsyncFSWAL#waitForSafePoint())
{code:java}
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28
prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition
[0x00007f9d661ef000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa03d52c298> (a
java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
at
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
at
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
at java.lang.Thread.run(Thread.java:745)
...
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller"
#253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on
condition [0x00007f9d649ba000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa03d2f18f8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
at java.lang.Thread.run(Thread.java:745){code}
WAL stuck error logs:
!wal-stuck-error-logs.png|width=924,height=225!
HBASE-14790 referred that the timeout setting is a hard problem. But we can
limit the shutdown time of WAL, it is also reasonable for MTTR.
was:
We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met
the problem than RS can not exit completely for several hours util manual
interventions.
We find the problem of RS exiting incompletely is relevant to WAL stuck
problems, as mentioned in HBASE-25631. When the blockOnSync timeout in flushing
cache, the RS will kill itself, but only stream broken problems can make sync
fail and broken the WAL, there is no socket timeout in fan out OutputStream, as
a result, the output flush can stuck for a unlimited long time.
The two jstacks below show that the regionserver thread can waiting unlimitedly
in both
AsyncFSWAL#waitForSafePoint()
{code:java}
"regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28
prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition
[0x00007fd7ec687000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fddc4a5ac68> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
at
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
at
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
at java.lang.Thread.run(Thread.java:745)
{code}
and AbstractFSWAL.rollWriterLock(when logRoller stuck in
AsyncFSWAL#waitForSafePoint())
{code:java}
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28
prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition
[0x00007f9d661ef000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa03d52c298> (a
java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
at
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
at
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
at java.lang.Thread.run(Thread.java:745)
...
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller"
#253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on
condition [0x00007f9d649ba000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa03d2f18f8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
at java.lang.Thread.run(Thread.java:745){code}
HBASE-14790 referred that the timeout setting is a hard problem. But we can
limit the shutdown time of WAL, it is also reasonable for MTTR.
> Limit the shutdown time of WAL
> ------------------------------
>
> Key: HBASE-25905
> URL: https://issues.apache.org/jira/browse/HBASE-25905
> Project: HBase
> Issue Type: Improvement
> Components: regionserver, wal
> Affects Versions: 3.0.0-alpha-1, 2.0.0
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
> Attachments: rs-jstack1, rs-jstack2, wal-stuck-error-logs.png
>
>
> We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met
> the problem than RS can not exit completely for several hours util manual
> interventions.
> We find the problem of RS exiting incompletely is relevant to WAL stuck
> problems, as mentioned in HBASE-25631. When the blockOnSync timeout in
> flushing cache, the RS will kill itself, but only stream broken problems can
> make sync fail and broken the WAL, there is no socket timeout in fan out
> OutputStream, as a result, the output flush can stuck for a unlimited long
> time.
> The two jstacks below show that the regionserver thread can waiting
> unlimitedly in both
> AsyncFSWAL#waitForSafePoint()
> {code:java}
> "regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28
> prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition
> [0x00007fd7ec687000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007fddc4a5ac68> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
> at
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
> at
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
> at
> org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> and AbstractFSWAL.rollWriterLock(when logRoller stuck in
> AsyncFSWAL#waitForSafePoint())
> {code:java}
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28
> prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition
> [0x00007f9d661ef000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007fa03d52c298> (a
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> at
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
> at
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
> at
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
> at
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
> at
> org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
> at java.lang.Thread.run(Thread.java:745)
> ...
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller"
> #253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on
> condition [0x00007f9d649ba000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007fa03d2f18f8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
> at
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
> at java.lang.Thread.run(Thread.java:745){code}
> WAL stuck error logs:
> !wal-stuck-error-logs.png|width=924,height=225!
> HBASE-14790 referred that the timeout setting is a hard problem. But we can
> limit the shutdown time of WAL, it is also reasonable for MTTR.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)