[
https://issues.apache.org/jira/browse/HBASE-25631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299464#comment-17299464
]
Xiaolin Ha commented on HBASE-25631:
------------------------------------
Hi, [~zhangduo], thanks for paying attention to this issue.
We are suffering from some WAL STUCK problems now, they often occur when
closing region or flushing cache, because of very slowly WAL syncs. The reason
of slow WAL syncs sometimes is machine hardware failures, but sometimes not,
I'm sorry we don't know clearly now.
Some error logs are as follows,
{code:java}
2021-03-10 22:02:27,193 FATAL [MemStoreFlusher.0] regionserver.HRegionServer:
ABORTING region server
rz-data-hbase-yarnlog09.rz.sankuai.com,16020,1611144388339: Replay of WAL
required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region:
LOG_FILES_RZ,873,1504768512247.6e80f0fd4c02202392f0fdb58174d932. at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2714)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2391)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2353)
at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2243)
at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2163) at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:606)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:567)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:357)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync
result after 30000 ms for txid=2157954, WAL system stuck? at
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:132) at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:697)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:631)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(WALUtil.java:151)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:79)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2691)
... 9 more
{code}
{code:java}
{code}
{code:java}
2021-03-04 14:11:56,508 INFO [RS_CLOSE_REGION-15] regionserver.HRegionServer:
STOPPED: Unrecoverable exception while closing region
mdata_user_compress_trace,67,1614761298978.0937d710434433166beef6cd4f68df5f.,
still finishing close
2021-03-04 14:11:56,508 WARN [DescriptorCacheCleaner]
conf.DescriptorCacheCleaner: DescriptorCacheCleaner sleep err,
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.hbase.conf.DescriptorCacheCleaner.run(DescriptorCacheCleaner.java:45)
2021-03-04 14:11:56,508 INFO
[regionserver/zf-data-hbase166.mt/10.56.19.44:16020]
regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2021-03-04 14:11:56,508 ERROR [RS_CLOSE_REGION-15] executor.EventHandler:
Caught throwable while processing event M_RS_CLOSE_REGION
java.lang.RuntimeException:
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync
result after 60000 ms for txid=917393320, WAL system stuck?
at
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:152)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get
sync result after 60000 ms for txid=917393320, WAL system stuck?
at
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:132)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:697)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:631)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(WALUtil.java:151)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:130)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:95)
at
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1132)
at
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1745)
at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1563)
at
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
... 4 more
{code}
> Non-daemon threads make regionserver cannot exit completely
> ------------------------------------------------------------
>
> Key: HBASE-25631
> URL: https://issues.apache.org/jira/browse/HBASE-25631
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 2.0.0, 1.4.7
> Environment:
>
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
> Attachments: 1614845335795-image.png, 1614857532776-image.png
>
>
> When regionserver abort by some errors, the process cannot exit completely.
> Error logs and jstack of regionserver process are as follows,
> !1614857532776-image.png|width=823,height=364!
> * !1614845335795-image.png|width=532,height=330!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)