[
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373103#comment-17373103
]
Michael Stack commented on HBASE-26042:
---------------------------------------
Some background notes:
On this cluster, the hang is unusual. What we see more often is 'clean' RS
aborts like below:
{code:java}
2021-06-26 12:18:11,725 ERROR [regionserver/XYZ:16020.logRoller]
regionserver.HRegionServer: ***** ABORTING region server
XYZ,16020,1622749552385: IOE in log roller *****
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer {code}
Looking in DN around this time, its this:
{code:java}
2021-06-26 12:18:18,210 ERROR [DataXceiver for client
DFSClient_NONMAPREDUCE_588272444_1 at /A.B.C.D:28343 [Receiving block
BP-1200783253-A.B.C.D1581116871410:blk_3413027507_2339403003]]
datanode.DataNode: A.B.C.D:9866:DataXceiver error processing WRITE_BLOCK
operation src: /A.B.C.D:28343 dst: /A.B.C.D:9866
java.lang.NullPointerException
at
sun.nio.ch.EPollArrayWrapper.isEventsHighKilled(EPollArrayWrapper.java:174)
{code}
(In this particular case the DN restarted: i.e. 'Connection reset by peer' but
it doesn't always cause a DN restart)
Poking around, it seems NPE when isEventsHighKilled means the DN doesn't have
enough fds (and/or JDK bug). It has 16k. Its running about 2k fds when idle.
Will try upping the fd count.
The DN NPE (or DN abort) looks to cause the RS to abort if it comes up in the
RS around log roll. Elsewhere, perhaps, it causes the hang – hard to tell for
sure.
Here is another example of an abort w/ corresponding DN NPE on
isEventsHighKilled complaint.
{code:java}
2021-06-25 19:21:12,921 INFO [DataStreamer for file
/hbase/data/default/xxx/3ab32c85a5c46f466dc21ecbcff53f6f/.tmp/f/81484a889b0e4c11bfcc71736a023d29]
hdfs.DataStreamer: Exception in createBlockOutputStream
blk_3406122337_2332497742
java.io.EOFException: Unexpected EOF while trying to read response from server
at
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:539)
at
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1762)
at
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
2021-06-25 19:21:12,922 WARN [DataStreamer for file
/hbase/data/default/xxx/3ab32c85a5c46f466dc21ecbcff53f6f/.tmp/f/81484a889b0e4c11bfcc71736a023d29]
hdfs.DataStreamer: Abandoning
BP-1200783253-A.B.C.240-1581116871410:blk_3406122337_2332497742
2021-06-25 19:21:12,924 WARN [DataStreamer for file
/hbase/data/default/xxx/3ab32c85a5c46f466dc21ecbcff53f6f/.tmp/f/81484a889b0e4c11bfcc71736a023d29]
hdfs.DataStreamer: Excluding datanode
DatanodeInfoWithStorage[17.58.122.105:9866,DS-1015be7f-b986-4853-b56f-280ac2e8db4a,DISK]
2021-06-25 19:21:12,924 INFO [DataStreamer for file
/hbase/data/default/xxx/3ab32c85a5c46f466dc21ecbcff53f6f/.tmp/f/81484a889b0e4c11bfcc71736a023d29]
hdfs.DataStreamer: Removing node
DatanodeInfoWithStorage[17.58.122.105:9866,DS-1015be7f-b986-4853-b56f-280ac2e8db4a,DISK]
from the excluded nodes list
2021-06-25 19:21:13,493 INFO [regionserver/ps1586:16020.leaseChecker]
regionserver.RSRpcServices: Scanner -2032258505885651566 lease expired on
region
xxx,\x9D\xDC\x00\x06Nl\xC5\xDEH\xCB\xF2\xE9\xC9J\x05-,1615374707984.0857141d2bb68aa1acb5f543f2bb78bd.
2021-06-25 19:21:14,189 ERROR [regionserver/ps1586:16020.logRoller]
regionserver.HRegionServer: ***** ABORTING region server
ps1586.a.b.c.d,16020,1622767149688: IOE in log roller *****
java.io.IOException: Connection to 17.58.122.105/17.58.122.105:9866 closed
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$channelInactive$2(FanOutOneBlockAsyncDFSOutput.java:286)
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:233)
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:98)
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.channelInactive(FanOutOneBlockAsyncDFSOutput.java:285)
{code}
> WAL lockup on 'sync failed'
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> readAddress(..) failed: Connection reset by peer
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.3.5
> Reporter: Michael Stack
> Priority: Major
> Attachments: js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://XXXX:8020/hbase]
> wal.AsyncFSWAL: sync
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints
> text versions of requests complaining requestsTooSlow. Then we start to see
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync
> result after 300000 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
> {code}
> I tried hacking the test to repro the above hang by throwing same exception
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is
> stuck w/o timeout trying to write a long on the WAL header:
>
> {code:java}
> Thread 9464: (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may be imprecise)
> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
> line=175 (Compiled frame)
> - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19,
> line=1707 (Compiled frame)
> -
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
> @bci=119, line=3323 (Compiled frame)
> - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115,
> line=1742 (Compiled frame)
> - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled
> frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
> @bci=16, line=189 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
> org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader)
> @bci=9, line=202 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean,
> long) @bci=107, line=170 (Compiled frame)
> -
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
> org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long,
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class)
> @bci=61, line=113 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
> @bci=22, line=651 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
> @bci=2, line=128 (Compiled frame)
> - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean)
> @bci=101, line=797 (Compiled frame)
> - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long)
> @bci=18, line=263 (Compiled frame)
> - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179
> (Compiled frame) {code}
>
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc.
> unable to add the ringbuffer:
>
> {code:java}
> Thread 9465: (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may be imprecise)
> - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338
> (Compiled frame)
> - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136
> (Compiled frame)
> - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105
> (Interpreted frame)
> - com.lmax.disruptor.RingBuffer.next() @bci=4, line=263 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$stampSequenceIdAndPublishToRingBuffer$1(org.apache.commons.lang3.mutable.MutableLong,
> com.lmax.disruptor.RingBuffer) @bci=2, line=1031 (Compiled frame)
> - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$$Lambda$270.run()
> @bci=8 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.begin(java.lang.Runnable)
> @bci=36, line=140 (Interpreted frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(org.apache.hadoop.hbase.client.RegionInfo,
> org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit,
> boolean, com.lmax.disruptor.RingBuffer) @bci=62, line=1030 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(org.apache.hadoop.hbase.client.RegionInfo,
> org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit,
> boolean) @bci=10, line=589 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendMarker(org.apache.hadoop.hbase.client.RegionInfo,
> org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit)
> @bci=5, line=1081 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(org.apache.hadoop.hbase.wal.WAL,
> java.util.NavigableMap, org.apache.hadoop.hbase.client.RegionInfo,
> org.apache.hadoop.hbase.wal.WALEdit,
> org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl,
> java.util.Map, boolean) @bci=39, line=161 (Compiled frame)
> -
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(org.apache.hadoop.hbase.wal.WAL,
> java.util.NavigableMap, org.apache.hadoop.hbase.client.RegionInfo,
> org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$FlushDescriptor,
> boolean, org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl)
> @bci=13, line=89 (Compiled frame)
> {code}
>
> Interesting is how more than one thread is able to be inside the synchronize
> block in mvcc#begin seemingly....
> In
--
This message was sent by Atlassian Jira
(v8.3.4#803005)