[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376684#comment-17376684
 ] 

Michael Stack commented on HBASE-26042:
---------------------------------------

Tried reproducing
{code:java}
2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://XXXX:8020/hbase] 
wal.AsyncFSWAL: sync 
failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
 readAddress(..) failed: Connection reset by peer  {code}
... by killing DNs. There is complaint about broken stream but then all gets 
neatly closed up and after log roll, all is good again.

Pausing the DN process got me these:
{code:java}
2021-07-07 14:44:38,471 WARN 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed
java.io.IOException: Timeout(60000ms) waiting for response
        at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$userEventTriggered$4(FanOutOneBlockAsyncDFSOutput.java:300)
        at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:233)
        at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:98)
        at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.userEventTriggered(FanOutOneBlockAsyncDFSOutput.java:299)
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346)
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:332)
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:324)
        at 
org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:117)
        at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365)
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346)
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:332)
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:324)
        at 
org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelIdle(IdleStateHandler.java:371)
        at 
org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler$ReaderIdleTimeoutTask.run(IdleStateHandler.java:504)
        at 
org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:476)
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 {code}
...not the same as what I want. Here the exception has a nice stack trace and 
is being 'handled'.

Related, on another cluster, where fds were 64k and I'd loaded up a single RS 
w/ hundreds of Regions, I saw this:
{code:java}
2021-07-07 00:59:27,372 WARN 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline: An 
exceptionCaught() event was fired, and it reached at the tail of the pipeline. 
It usually means the last handler in the pipeline did not handle the 
exception.org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
 accept(..) failed: Too many open files {code}
Indeed lsof confirmed fds were at limit. This WARN looks a bit like the one I'm 
trying to deal w/ here w/ some nice extra info out of netty.

[~huaxiangsun] found this CASSANDRA-13649 which has similar looking WARNs from 
netty and accredits lack of an exception handler on the netty pipeline as 
reason for the WARNs showing in logs.

We don't have an exception handler on our server-side pipeline. Adding one that 
does cleanup might prevent the lockup? (Hard to say for sure when hard-time 
reproducing scenario)

 

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-26042
>                 URL: https://issues.apache.org/jira/browse/HBASE-26042
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.3.5
>            Reporter: Michael Stack
>            Priority: Major
>         Attachments: js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://XXXX:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 300000 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 
> (Interpreted frame)
>  - com.lmax.disruptor.RingBuffer.next() @bci=4, line=263 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$stampSequenceIdAndPublishToRingBuffer$1(org.apache.commons.lang3.mutable.MutableLong,
>  com.lmax.disruptor.RingBuffer) @bci=2, line=1031 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$$Lambda$270.run() 
> @bci=8 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.begin(java.lang.Runnable)
>  @bci=36, line=140 (Interpreted frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(org.apache.hadoop.hbase.client.RegionInfo,
>  org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, 
> boolean, com.lmax.disruptor.RingBuffer) @bci=62, line=1030 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(org.apache.hadoop.hbase.client.RegionInfo,
>  org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, 
> boolean) @bci=10, line=589 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendMarker(org.apache.hadoop.hbase.client.RegionInfo,
>  org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit) 
> @bci=5, line=1081 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(org.apache.hadoop.hbase.wal.WAL,
>  java.util.NavigableMap, org.apache.hadoop.hbase.client.RegionInfo, 
> org.apache.hadoop.hbase.wal.WALEdit, 
> org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl, 
> java.util.Map, boolean) @bci=39, line=161 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(org.apache.hadoop.hbase.wal.WAL,
>  java.util.NavigableMap, org.apache.hadoop.hbase.client.RegionInfo, 
> org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$FlushDescriptor, 
> boolean, org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl) 
> @bci=13, line=89 (Compiled frame)
>  {code}
>  
> Interesting is how more than one thread is able to be inside the synchronize 
> block in mvcc#begin seemingly....
> In



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to