[jira] [Commented] (HBASE-28584) RS SIGSEGV under heavy replication load

Andrew Kyle Purtell (Jira) Fri, 24 May 2024 11:34:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-28584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849365#comment-17849365
 ]


Andrew Kyle Purtell commented on HBASE-28584:
---------------------------------------------

In our crashes we have observed the frame triggering the SIGSEGV or SIGBUS to 
be native memcmp or memcpy... one of the stubs that our use of Unsafe leads to, 
so I have been focused perhaps unnecessarily on adding assertions to bring back 
bounds checking for debug in test environments. (Our use of Unsafe deliberately 
avoids the usual Java language and memory model safeties on purpose for 
performance.) 

However the frame in this report has no use of Unsafe

{noformat}
# Problematic frame:
# J 24625 c2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
{noformat}

copyBufferToStream is straight java code using standard ByteBuffer APIs. 

Perhaps the issue is entirely due to Netty use of direct buffers, and Netty's 
requirement that we get reference counting correct in all circumstances. 

In that regard, I have been wondering if we are due to load and allocation 
pressure hitting race conditions we've always had but won't manifest unless 
under a lot of pressure. One potential case is we accidentally release a buffer 
that is still in use. In this scenario we would normally have enough time for 
the reader of the released buffer to complete, but, under high load and 
allocation pressure, the buffer would be quickly recycled, and the reader on 
the buffer would end up racing with either the reclamation or a new writer.  
So we should still debug with assertions (Preconditions) to try and catch this, 
but they should focus on asserting the validity, shape, and ownership of the 
direct buffer somehow. 

> RS SIGSEGV under heavy replication load
> ---------------------------------------
>
>                 Key: HBASE-28584
>                 URL: https://issues.apache.org/jira/browse/HBASE-28584
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 2.5.6
>         Environment: RHEL 7.9
> JDK 11.0.23
> Hadoop 3.2.4
> Hbase 2.5.6
>            Reporter: Whitney Jackson
>            Priority: Major
>
> I'm observing RS crashes under heavy replication load:
>  
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f7546873b69, pid=29890, tid=36828
> #
> # JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build 
> 11.0.23+7-LTS-222)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed 
> mode, tiered, compressed oops, g1 gc, linux-amd64)
> # Problematic frame:
> # J 24625 c2 
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
>  (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> {code}
>  
> The heavier load comes when a replication peer has been disabled for several 
> hours for patching etc. When the peer is re-enabled the replication load is 
> high until the peer is all caught up. The crashes happen on the cluster 
> receiving the replication edits.
>  
> I believe this problem started after upgrading from 2.4.x to 2.5.x.
>  
> One possibly relevant non-standard config I run with:
> {code:java}
> <property>
>   <name>hbase.region.store.parallel.put.limit</name>
>   <!-- Default: 10  -->
>   <value>100</value>
>   <description>Added after seeing "failed to accept edits" replication errors 
> in the destination region servers indicating this limit was being exceeded 
> while trying to process replication edits.</description>
> </property>
> {code}
>  
> I understand from other Jiras that the problem is likely around direct memory 
> usage by Netty. I haven't yet tried switching the Netty allocator to 
> {{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the  
> {{io.netty.allocator.*}} options.
>  
> {{MaxDirectMemorySize}} is set to 26g.
>  
> Here's the full stack for the relevant thread:
>  
> {code:java}
> Stack: [0x00007f72e2e5f000,0x00007f72e2f60000],  sp=0x00007f72e2f5e450,  free 
> space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 24625 c2 
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
>  (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> J 26253 c2 
> org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I 
> (21 bytes) @ 0x00007f7545af2d84 [0x00007f7545af2d20+0x0000000000000064]
> J 22971 c2 
> org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
>  (27 bytes) @ 0x00007f754663f700 [0x00007f754663f4c0+0x0000000000000240]
> J 25251 c2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (90 bytes) @ 0x00007f7546a53038 [0x00007f7546a50e60+0x00000000000021d8]
> J 21182 c2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (73 bytes) @ 0x00007f7545f4d90c [0x00007f7545f4d3a0+0x000000000000056c]
> J 21181 c2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (149 bytes) @ 0x00007f7545fd680c [0x00007f7545fd65e0+0x000000000000022c]
> J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V 
> (16 bytes) @ 0x00007f7546ade660 [0x00007f7546ade140+0x0000000000000520]
> J 24098 c2 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
>  (109 bytes) @ 0x00007f754678fbb8 [0x00007f754678f8e0+0x00000000000002d8]
> J 27297% c2 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603 
> bytes) @ 0x00007f75466c4d48 [0x00007f75466c4c80+0x00000000000000c8]
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
> j  
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
> J 12278 c1 java.lang.Thread.run()V java.base@11.0.23 (17 bytes) @ 
> 0x00007f753e11f084 [0x00007f753e11ef40+0x0000000000000144]
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x85574a]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x27a
> V  [libjvm.so+0x853d2e]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, 
> Symbol*, Symbol*, Thread*)+0x19e
> V  [libjvm.so+0x8ffddf]  thread_entry(JavaThread*, Thread*)+0x9f
> V  [libjvm.so+0xdb68d1]  JavaThread::thread_main_inner()+0x131
> V  [libjvm.so+0xdb2c4c]  Thread::call_run()+0x13c
> V  [libjvm.so+0xc1f2e6]  thread_native_entry(Thread*)+0xe6
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-28584) RS SIGSEGV under heavy replication load

Reply via email to