[
https://issues.apache.org/jira/browse/HBASE-28584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866571#comment-17866571
]
Andrew Kyle Purtell edited comment on HBASE-28584 at 7/17/24 12:37 AM:
-----------------------------------------------------------------------
Rug-pulling the sink side replicator
So after I realized we are crashing while iterating the cell scanner there is a
simple one line mitigation for this problem, attached as
[^0001-Deep-clone-cells-set-to-be-replicated-onto-the-local.patch]
I am now unable to reproduce SEGV crashes with the same scenario that would
reliably trigger them.
The remaining work here is to trace how we are prematurely releasing() the
buffer that backs the cell scanner.
was (Author: apurtell):
Rug-pulling the sink side replicator
So after I realized we are crashing while iterating the cell scanner there is a
simple one line mitigation for this problem, attached as
[^0001-Deep-clone-cells-set-to-be-replicated-onto-the-local.patch]
I am now unable to reproduce SEGV crashes with the same scenario that would
reliably trigger them.
The remaining work here is to trace how we are releasing() the buffers that
back the cell scanner. My guess is an exception thrown while processing one
row-batch gets thrown back to the server rpc handler, which cleans up the call
and releases the buffer, and under pressure the buffer is recycled, and
meanwhile in parallel another thread is still working on one of the other
row-batches derived from the replication RPC we just cleaned up and released
resources of.
> RS SIGSEGV under heavy replication load
> ---------------------------------------
>
> Key: HBASE-28584
> URL: https://issues.apache.org/jira/browse/HBASE-28584
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 2.5.6
> Environment: RHEL 7.9
> JDK 11.0.23
> Hadoop 3.2.4
> Hbase 2.5.6
> Reporter: Whitney Jackson
> Assignee: Andrew Kyle Purtell
> Priority: Major
> Attachments:
> 0001-Deep-clone-cells-set-to-be-replicated-onto-the-local.patch,
> 0001-Support-configuration-based-selection-of-netty-chann.patch
>
>
> I'm observing RS crashes under heavy replication load:
>
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00007f7546873b69, pid=29890, tid=36828
> #
> # JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build
> 11.0.23+7-LTS-222)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed
> mode, tiered, compressed oops, g1 gc, linux-amd64)
> # Problematic frame:
> # J 24625 c2
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
> (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> {code}
>
> The heavier load comes when a replication peer has been disabled for several
> hours for patching etc. When the peer is re-enabled the replication load is
> high until the peer is all caught up. The crashes happen on the cluster
> receiving the replication edits.
>
> I believe this problem started after upgrading from 2.4.x to 2.5.x.
>
> One possibly relevant non-standard config I run with:
> {code:java}
> <property>
> <name>hbase.region.store.parallel.put.limit</name>
> <!-- Default: 10 -->
> <value>100</value>
> <description>Added after seeing "failed to accept edits" replication errors
> in the destination region servers indicating this limit was being exceeded
> while trying to process replication edits.</description>
> </property>
> {code}
>
> I understand from other Jiras that the problem is likely around direct memory
> usage by Netty. I haven't yet tried switching the Netty allocator to
> {{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the
> {{io.netty.allocator.*}} options.
>
> {{MaxDirectMemorySize}} is set to 26g.
>
> Here's the full stack for the relevant thread:
>
> {code:java}
> Stack: [0x00007f72e2e5f000,0x00007f72e2f60000], sp=0x00007f72e2f5e450, free
> space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> J 24625 c2
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
> (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> J 26253 c2
> org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I
> (21 bytes) @ 0x00007f7545af2d84 [0x00007f7545af2d20+0x0000000000000064]
> J 22971 c2
> org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
> (27 bytes) @ 0x00007f754663f700 [0x00007f754663f4c0+0x0000000000000240]
> J 25251 c2
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
> (90 bytes) @ 0x00007f7546a53038 [0x00007f7546a50e60+0x00000000000021d8]
> J 21182 c2
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
> (73 bytes) @ 0x00007f7545f4d90c [0x00007f7545f4d3a0+0x000000000000056c]
> J 21181 c2
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
> (149 bytes) @ 0x00007f7545fd680c [0x00007f7545fd65e0+0x000000000000022c]
> J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V
> (16 bytes) @ 0x00007f7546ade660 [0x00007f7546ade140+0x0000000000000520]
> J 24098 c2
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
> (109 bytes) @ 0x00007f754678fbb8 [0x00007f754678f8e0+0x00000000000002d8]
> J 27297% c2
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603
> bytes) @ 0x00007f75466c4d48 [0x00007f75466c4c80+0x00000000000000c8]
> j
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
> j
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
> j
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
> J 12278 c1 java.lang.Thread.run()V [email protected] (17 bytes) @
> 0x00007f753e11f084 [0x00007f753e11ef40+0x0000000000000144]
> v ~StubRoutines::call_stub
> V [libjvm.so+0x85574a] JavaCalls::call_helper(JavaValue*, methodHandle
> const&, JavaCallArguments*, Thread*)+0x27a
> V [libjvm.so+0x853d2e] JavaCalls::call_virtual(JavaValue*, Handle, Klass*,
> Symbol*, Symbol*, Thread*)+0x19e
> V [libjvm.so+0x8ffddf] thread_entry(JavaThread*, Thread*)+0x9f
> V [libjvm.so+0xdb68d1] JavaThread::thread_main_inner()+0x131
> V [libjvm.so+0xdb2c4c] Thread::call_run()+0x13c
> V [libjvm.so+0xc1f2e6] thread_native_entry(Thread*)+0xe6
> {code}
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)