[ 
https://issues.apache.org/jira/browse/HBASE-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaolin Ha updated HBASE-25905:
-------------------------------
    Summary: Shutdown of WAL stuck at waitForSafePoint  (was: Shutdown of WAL 
stuck at )

> Shutdown of WAL stuck at waitForSafePoint
> -----------------------------------------
>
>                 Key: HBASE-25905
>                 URL: https://issues.apache.org/jira/browse/HBASE-25905
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 3.0.0-alpha-1, 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Critical
>         Attachments: rs-jstack1, rs-jstack2, wal-stuck-error-logs.png
>
>
> We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met 
> the problem than RS can not exit completely for several hours util manual 
> interventions.
> We find the problem of RS exiting incompletely is relevant to WAL stuck 
> problems, as mentioned in HBASE-25631. When the blockOnSync timeout in 
> flushing cache, the RS will kill itself, but only stream broken problems can 
> make sync fail and broken the WAL, there is no socket timeout in fan out 
> OutputStream, as a result, the output flush can stuck for a unlimited long 
> time.
> The two jstacks below show that the regionserver thread can waiting 
> unlimitedly in both 
> AsyncFSWAL#waitForSafePoint()
> {code:java}
> "regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28 
> prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition 
> [0x00007fd7ec687000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fddc4a5ac68> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
>         at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> and AbstractFSWAL.rollWriterLock(when logRoller stuck in 
> AsyncFSWAL#waitForSafePoint())
> {code:java}
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28 
> prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition 
> [0x00007f9d661ef000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fa03d52c298> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>         at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>         at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
>         at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
>         at java.lang.Thread.run(Thread.java:745)
> ...
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller"
>  #253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on 
> condition [0x00007f9d649ba000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fa03d2f18f8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
>         at 
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
>         at java.lang.Thread.run(Thread.java:745){code}
>  WAL stuck error logs:
> !wal-stuck-error-logs.png|width=924,height=225!
> HBASE-14790 referred that the timeout setting is a hard problem. But we can 
> limit the shutdown time of WAL, it is also reasonable for MTTR.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to