[ 
https://issues.apache.org/jira/browse/HBASE-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-25905:
-------------------------------
    Description: 
We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met 
the problem than RS can not exit completely for several hours util manual 
interventions.

We find the problem of RS exiting incompletely is relevant to WAL stuck 
problems, as mentioned in HBASE-25631. When the blockOnSync timeout in flushing 
cache, the RS will kill itself, but only stream broken problems can make sync 
fail and broken the WAL, there is no socket timeout in fan out OutputStream, as 
a result, the output flush can stuck for a unlimited long time.

The two jstacks below show that the regionserver thread can waiting unlimitedly 
in both 

AsyncFSWAL#waitForSafePoint()
{code:java}
"regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28 
prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition 
[0x00007fd7ec687000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fddc4a5ac68> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
        at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
        at 
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
        at java.lang.Thread.run(Thread.java:745)
{code}
and AbstractFSWAL.rollWriterLock(when logRoller stuck in 
AsyncFSWAL#waitForSafePoint())
{code:java}
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28 
prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition 
[0x00007f9d661ef000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fa03d52c298> (a 
java.util.concurrent.locks.ReentrantLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
        at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
        at 
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
        at java.lang.Thread.run(Thread.java:745)
...
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller"
 #253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on 
condition [0x00007f9d649ba000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fa03d2f18f8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
        at 
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
        at java.lang.Thread.run(Thread.java:745){code}
 WAL stuck error logs:

!wal-stuck-error-logs.png|width=924,height=225!

HBASE-14790 referred that the timeout setting is a hard problem. But we can 
limit the shutdown time of WAL, it is also reasonable for MTTR.

 

  was:
We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met 
the problem than RS can not exit completely for several hours util manual 
interventions.

We find the problem of RS exiting incompletely is relevant to WAL stuck 
problems, as mentioned in HBASE-25631. When the blockOnSync timeout in flushing 
cache, the RS will kill itself, but only stream broken problems can make sync 
fail and broken the WAL, there is no socket timeout in fan out OutputStream, as 
a result, the output flush can stuck for a unlimited long time.

The two jstacks below show that the regionserver thread can waiting unlimitedly 
in both 

AsyncFSWAL#waitForSafePoint()
{code:java}
"regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28 
prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition 
[0x00007fd7ec687000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fddc4a5ac68> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
        at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
        at 
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
        at java.lang.Thread.run(Thread.java:745)
{code}
and AbstractFSWAL.rollWriterLock(when logRoller stuck in 
AsyncFSWAL#waitForSafePoint())
{code:java}
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28 
prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition 
[0x00007f9d661ef000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fa03d52c298> (a 
java.util.concurrent.locks.ReentrantLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
        at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
        at 
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
        at java.lang.Thread.run(Thread.java:745)
...
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller"
 #253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on 
condition [0x00007f9d649ba000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fa03d2f18f8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
        at 
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
        at java.lang.Thread.run(Thread.java:745){code}
 

HBASE-14790 referred that the timeout setting is a hard problem. But we can 
limit the shutdown time of WAL, it is also reasonable for MTTR.

 


> Limit the shutdown time of WAL
> ------------------------------
>
>                 Key: HBASE-25905
>                 URL: https://issues.apache.org/jira/browse/HBASE-25905
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, wal
>    Affects Versions: 3.0.0-alpha-1, 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>         Attachments: rs-jstack1, rs-jstack2, wal-stuck-error-logs.png
>
>
> We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met 
> the problem than RS can not exit completely for several hours util manual 
> interventions.
> We find the problem of RS exiting incompletely is relevant to WAL stuck 
> problems, as mentioned in HBASE-25631. When the blockOnSync timeout in 
> flushing cache, the RS will kill itself, but only stream broken problems can 
> make sync fail and broken the WAL, there is no socket timeout in fan out 
> OutputStream, as a result, the output flush can stuck for a unlimited long 
> time.
> The two jstacks below show that the regionserver thread can waiting 
> unlimitedly in both 
> AsyncFSWAL#waitForSafePoint()
> {code:java}
> "regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28 
> prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition 
> [0x00007fd7ec687000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fddc4a5ac68> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
>         at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> and AbstractFSWAL.rollWriterLock(when logRoller stuck in 
> AsyncFSWAL#waitForSafePoint())
> {code:java}
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28 
> prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition 
> [0x00007f9d661ef000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fa03d52c298> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>         at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>         at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
>         at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
>         at java.lang.Thread.run(Thread.java:745)
> ...
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller"
>  #253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on 
> condition [0x00007f9d649ba000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fa03d2f18f8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
>         at 
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
>         at java.lang.Thread.run(Thread.java:745){code}
>  WAL stuck error logs:
> !wal-stuck-error-logs.png|width=924,height=225!
> HBASE-14790 referred that the timeout setting is a hard problem. But we can 
> limit the shutdown time of WAL, it is also reasonable for MTTR.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to