[
https://issues.apache.org/jira/browse/HBASE-26411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482614#comment-17482614
]
Bryan Beaudreault edited comment on HBASE-26411 at 1/26/22, 4:51 PM:
---------------------------------------------------------------------
I've hit a case of this as well I think. Not exactly the same, but I have an
aborting regionserver hung waiting on this.
A regionserver aborted, but the process stayed up indefinitely. Here's the
stacks:
{code:java}
"regionserver/test2-host:60020.logRoller" #240 daemon prio=5 os_prio=0
cpu=88.15ms elapsed=3809.72s tid=0x00007f7820225000 nid=0x550a waiting on
condition [0x00007f77f1edd000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00000007a0ae1578> (a
java.util.concurrent.CompletableFuture$Signaller)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1796)
at
java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3128)
at
java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1823)
at
java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:1998)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
at
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:669)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:130)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:841)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187)
{code}
This is holding the rollWriterLock, and meanwhile shutdown is trying to acquire
it:
{code:java}
"regionserver/test2-host:60020" #16 prio=5 os_prio=0 cpu=2870.36ms
elapsed=3810.32s tid=0x00007f783e8a8000 nid=0x54fd waiting on condition
[0x00007f77f41eb000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x0000000601b10cc0> (a
java.util.concurrent.locks.ReentrantLock$FairSync)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at
java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:902)
at
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:187)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:240)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1544)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1188)
{code}
I'm not very familiar with this code but I have a feeling the problem is that
whatever is supposed to be serving the AsyncProtobufLogWriter.write request has
been shutdown already.
This is on version 2.4.6
Edit, here's the exception that initiated the abort, if it helps:
{code:java}
2022-01-26 16:00:22,942 [RS_CLOSE_REGION-regionserver/test2-host:60020-2] ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ***** ABORTING region
server test2-host,60020,1643209983646: Unrecoverable exception while closing
region,@\x00\x00\x00,1642801343755.63f8321cbae0f0880749771882337344. *****
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync
result after 300000 ms for txid=3588119, WAL system stuck?
at
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:800)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:663)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:622)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(WALUtil.java:163)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:140)
at
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:105)
at
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1254)
at
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1887)
at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1620)
at
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:107)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834) {code}
The RegionServer was having some GC pain prior to this exception.
was (Author: bbeaudreault):
I've hit a case of this as well I think. Not exactly the same, but I have an
aborting regionserver hung waiting on this.
A regionserver aborted, but the process stayed up indefinitely. Here's the
stacks:
{code:java}
"regionserver/test2-host:60020.logRoller" #240 daemon prio=5 os_prio=0
cpu=88.15ms elapsed=3809.72s tid=0x00007f7820225000 nid=0x550a waiting on
condition [0x00007f77f1edd000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00000007a0ae1578> (a
java.util.concurrent.CompletableFuture$Signaller)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1796)
at
java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3128)
at
java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1823)
at
java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:1998)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
at
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:669)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:130)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:841)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187)
{code}
This is holding the rollWriterLock, and meanwhile shutdown is trying to acquire
it:
{code:java}
"regionserver/test2-host:60020" #16 prio=5 os_prio=0 cpu=2870.36ms
elapsed=3810.32s tid=0x00007f783e8a8000 nid=0x54fd waiting on condition
[0x00007f77f41eb000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x0000000601b10cc0> (a
java.util.concurrent.locks.ReentrantLock$FairSync)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at
java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:902)
at
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:187)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:240)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1544)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1188)
{code}
I'm not very familiar with this code but I have a feeling the problem is that
whatever is supposed to be serving the AsyncProtobufLogWriter.write request has
been shutdown already.
This is on version 2.4.6
> Wal do not roll and write a big wal
> ------------------------------------
>
> Key: HBASE-26411
> URL: https://issues.apache.org/jira/browse/HBASE-26411
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.8
> Reporter: Lijin Bin
> Priority: Major
>
> We see wal have long time to roll and write a big wal which has 3TB.
> And according to the jstack we can see the wal create hang.
> {code}
> "regionserver/11.149.48.227:60020.logRoller" #667 daemon prio=5 os_prio=0
> cpu=116916.81ms elapsed=447455.26s tid=0x00007fa35d231000 nid=0xbdd2 waiting
> on condition [0x00007f79c7407000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007f9f10df5158> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:178)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:191)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
> at
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:615)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:126)
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:763)
> at
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:184)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)