[ 
https://issues.apache.org/jira/browse/HBASE-26411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482614#comment-17482614
 ] 

Bryan Beaudreault edited comment on HBASE-26411 at 1/26/22, 4:51 PM:
---------------------------------------------------------------------

I've hit a case of this as well I think. Not exactly the same, but I have an 
aborting regionserver hung waiting on this.

A regionserver aborted, but the process stayed up indefinitely. Here's the 
stacks:

 
{code:java}
"regionserver/test2-host:60020.logRoller" #240 daemon prio=5 os_prio=0 
cpu=88.15ms elapsed=3809.72s tid=0x00007f7820225000 nid=0x550a waiting on 
condition  [0x00007f77f1edd000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000007a0ae1578> (a 
java.util.concurrent.CompletableFuture$Signaller)
        at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at 
java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1796)
        at 
java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3128)
        at 
java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1823)
        at 
java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:1998)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
        at 
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:669)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:130)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:841)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187) 
{code}
This is holding the rollWriterLock, and meanwhile shutdown is trying to acquire 
it:
{code:java}
"regionserver/test2-host:60020" #16 prio=5 os_prio=0 cpu=2870.36ms 
elapsed=3810.32s tid=0x00007f783e8a8000 nid=0x54fd waiting on condition  
[0x00007f77f41eb000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x0000000601b10cc0> (a 
java.util.concurrent.locks.ReentrantLock$FairSync)
        at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at 
java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:902)
        at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:187)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:240)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1544)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1188) 
{code}
 

I'm not very familiar with this code but I have a feeling the problem is that 
whatever is supposed to be serving the AsyncProtobufLogWriter.write request has 
been shutdown already.

This is on version 2.4.6

Edit, here's the exception that initiated the abort, if it helps:
{code:java}
2022-01-26 16:00:22,942 [RS_CLOSE_REGION-regionserver/test2-host:60020-2] ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: ***** ABORTING region 
server test2-host,60020,1643209983646: Unrecoverable exception while closing 
region,@\x00\x00\x00,1642801343755.63f8321cbae0f0880749771882337344. *****
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
result after 300000 ms for txid=3588119, WAL system stuck?
        at 
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:800)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:663)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:622)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(WALUtil.java:163)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:140)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:105)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1254)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1887)
        at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1620)
        at 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:107)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834) {code}
The RegionServer was having some GC pain prior to this exception.


was (Author: bbeaudreault):
I've hit a case of this as well I think. Not exactly the same, but I have an 
aborting regionserver hung waiting on this.

A regionserver aborted, but the process stayed up indefinitely. Here's the 
stacks:

 
{code:java}
"regionserver/test2-host:60020.logRoller" #240 daemon prio=5 os_prio=0 
cpu=88.15ms elapsed=3809.72s tid=0x00007f7820225000 nid=0x550a waiting on 
condition  [0x00007f77f1edd000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000007a0ae1578> (a 
java.util.concurrent.CompletableFuture$Signaller)
        at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at 
java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1796)
        at 
java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3128)
        at 
java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1823)
        at 
java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:1998)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
        at 
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:669)
        at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:130)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:841)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268)
        at 
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187) 
{code}
This is holding the rollWriterLock, and meanwhile shutdown is trying to acquire 
it:
{code:java}
"regionserver/test2-host:60020" #16 prio=5 os_prio=0 cpu=2870.36ms 
elapsed=3810.32s tid=0x00007f783e8a8000 nid=0x54fd waiting on condition  
[0x00007f77f41eb000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x0000000601b10cc0> (a 
java.util.concurrent.locks.ReentrantLock$FairSync)
        at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at 
java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:902)
        at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:187)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:240)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1544)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1188) 
{code}
 

I'm not very familiar with this code but I have a feeling the problem is that 
whatever is supposed to be serving the AsyncProtobufLogWriter.write request has 
been shutdown already.

This is on version 2.4.6

> Wal do not roll and write a big wal 
> ------------------------------------
>
>                 Key: HBASE-26411
>                 URL: https://issues.apache.org/jira/browse/HBASE-26411
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.8
>            Reporter: Lijin Bin
>            Priority: Major
>
> We see wal have long time to roll and write a big wal which has 3TB.
> And according to the jstack we can see the wal create hang.
> {code}
> "regionserver/11.149.48.227:60020.logRoller" #667 daemon prio=5 os_prio=0 
> cpu=116916.81ms elapsed=447455.26s tid=0x00007fa35d231000 nid=0xbdd2 waiting 
> on condition [0x00007f79c7407000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f9f10df5158> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>         at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>         at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>         at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:178)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:191)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
>         at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:615)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:126)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:763)
>         at 
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:184)
>         at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to