[jira] [Comment Edited] (HBASE-28584) RS SIGSEGV under heavy replication load

Andrew Kyle Purtell (Jira) Tue, 16 Jul 2024 17:38:08 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-28584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866573#comment-17866573
 ]


Andrew Kyle Purtell edited comment on HBASE-28584 at 7/17/24 12:36 AM:
-----------------------------------------------------------------------

As a side effect of chasing down this issue I have another patch which may be 
generally useful, if someone wants to be able to configure whether the netty 
client uses the pooled or unpooled allocator. Let me drop it here as  
[^0001-Support-configuration-based-selection-of-netty-chann.patch] for now.


was (Author: apurtell):
 [^0001-Support-configuration-based-selection-of-netty-chann.patch] As a side 
effect of chasing down this issue I have another patch which may be generally 
useful, if someone wants to be able to configure whether the netty client uses 
the pooled or unpooled allocator. Let me drop it here as  
[^0001-Deep-clone-cells-set-to-be-replicated-onto-the-local.patch] for now.

> RS SIGSEGV under heavy replication load
> ---------------------------------------
>
>                 Key: HBASE-28584
>                 URL: https://issues.apache.org/jira/browse/HBASE-28584
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 2.5.6
>         Environment: RHEL 7.9
> JDK 11.0.23
> Hadoop 3.2.4
> Hbase 2.5.6
>            Reporter: Whitney Jackson
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>         Attachments: 
> 0001-Deep-clone-cells-set-to-be-replicated-onto-the-local.patch, 
> 0001-Support-configuration-based-selection-of-netty-chann.patch
>
>
> I'm observing RS crashes under heavy replication load:
>  
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f7546873b69, pid=29890, tid=36828
> #
> # JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build 
> 11.0.23+7-LTS-222)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed 
> mode, tiered, compressed oops, g1 gc, linux-amd64)
> # Problematic frame:
> # J 24625 c2 
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
>  (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> {code}
>  
> The heavier load comes when a replication peer has been disabled for several 
> hours for patching etc. When the peer is re-enabled the replication load is 
> high until the peer is all caught up. The crashes happen on the cluster 
> receiving the replication edits.
>  
> I believe this problem started after upgrading from 2.4.x to 2.5.x.
>  
> One possibly relevant non-standard config I run with:
> {code:java}
> <property>
>   <name>hbase.region.store.parallel.put.limit</name>
>   <!-- Default: 10  -->
>   <value>100</value>
>   <description>Added after seeing "failed to accept edits" replication errors 
> in the destination region servers indicating this limit was being exceeded 
> while trying to process replication edits.</description>
> </property>
> {code}
>  
> I understand from other Jiras that the problem is likely around direct memory 
> usage by Netty. I haven't yet tried switching the Netty allocator to 
> {{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the  
> {{io.netty.allocator.*}} options.
>  
> {{MaxDirectMemorySize}} is set to 26g.
>  
> Here's the full stack for the relevant thread:
>  
> {code:java}
> Stack: [0x00007f72e2e5f000,0x00007f72e2f60000],  sp=0x00007f72e2f5e450,  free 
> space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 24625 c2 
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
>  (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> J 26253 c2 
> org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I 
> (21 bytes) @ 0x00007f7545af2d84 [0x00007f7545af2d20+0x0000000000000064]
> J 22971 c2 
> org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
>  (27 bytes) @ 0x00007f754663f700 [0x00007f754663f4c0+0x0000000000000240]
> J 25251 c2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (90 bytes) @ 0x00007f7546a53038 [0x00007f7546a50e60+0x00000000000021d8]
> J 21182 c2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (73 bytes) @ 0x00007f7545f4d90c [0x00007f7545f4d3a0+0x000000000000056c]
> J 21181 c2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (149 bytes) @ 0x00007f7545fd680c [0x00007f7545fd65e0+0x000000000000022c]
> J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V 
> (16 bytes) @ 0x00007f7546ade660 [0x00007f7546ade140+0x0000000000000520]
> J 24098 c2 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
>  (109 bytes) @ 0x00007f754678fbb8 [0x00007f754678f8e0+0x00000000000002d8]
> J 27297% c2 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603 
> bytes) @ 0x00007f75466c4d48 [0x00007f75466c4c80+0x00000000000000c8]
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
> j  
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
> J 12278 c1 java.lang.Thread.run()V [email protected] (17 bytes) @ 
> 0x00007f753e11f084 [0x00007f753e11ef40+0x0000000000000144]
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x85574a]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x27a
> V  [libjvm.so+0x853d2e]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, 
> Symbol*, Symbol*, Thread*)+0x19e
> V  [libjvm.so+0x8ffddf]  thread_entry(JavaThread*, Thread*)+0x9f
> V  [libjvm.so+0xdb68d1]  JavaThread::thread_main_inner()+0x131
> V  [libjvm.so+0xdb2c4c]  Thread::call_run()+0x13c
> V  [libjvm.so+0xc1f2e6]  thread_native_entry(Thread*)+0xe6
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-28584) RS SIGSEGV under heavy replication load

Reply via email to