Bryan Beaudreault created HBASE-26715:
-----------------------------------------
Summary: RegionServer should abort of rollWAL cannot complete in a
timely manner
Key: HBASE-26715
URL: https://issues.apache.org/jira/browse/HBASE-26715
Project: HBase
Issue Type: Bug
Reporter: Bryan Beaudreault
Ran into an issue on hbase 2.4.6, I think related to HBASE-26679. Individual
writes are blocking on SyncFuture, which never gets completed. Eventually (5m)
the writes timeout and fail. But the regionserver hung on like this basically
forever until I killed it about 14 hours later. While 26679 may fix the hang
bug, I think we should have additional protection against such zombie states.
In this case I think what happened is that the rollWAL was requested due to
failed appends, but it also hung forever. See the below stack trace:
{code:java}
Thread 240 (regionserver/host:60020.logRoller):
State: WAITING
Blocked count: 38
Waited count: 293
Waiting on java.util.concurrent.CompletableFuture$Signaller@13342c6d
Stack:
[email protected]/jdk.internal.misc.Unsafe.park(Native Method)
[email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
[email protected]/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
[email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
[email protected]/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
[email protected]/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
app//org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:189)
app//org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(AsyncProtobufLogWriter.java:202)
app//org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(AbstractProtobufLogWriter.java:170)
app//org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(AsyncFSWALProvider.java:113)
app//org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:669)
app//org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(AsyncFSWAL.java:130)
app//org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:841)
app//org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268)
app//org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187)
{code}
The wall roller thread was stuck on this wait seemingly forever, so it was
never able to roll the wal and get writes working again. I think we should add
a timeout here, and abort the regionserver if a WAL cannot be rolled in a
timely manner.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)